generate an idx file for a vcf


I have a vcf file that does not have an associated index file. Is it possible to create one? If yes, what module do I need to use?

Thanks for your input!



    We use tabix ( http://www.htslib.org/doc/tabix.html ) from Samtools, on the command line, naturally.

    Does a spec exist for the index file format?

    From what I've been able to find, the .vcf.idx file that UnifiedGenotyper creates is a different format than tabix. Hexdump reveals the .vcf.idx file has a magic number TIDX (54 49 44 58) while the tabix spec says it has TBI\1 (http://samtools.github.io/hts-specs/tabix.pdf). Not clear to me what the \1 is supposed to mean, but these obviously are different files. I did find some discussions that seem to indicate a tabix file can be used with GATK, but I'm not sure that's universally true so I'd like to stick with .vcf.idx.

    Why I am asking. In my use case I mapped reads to a reference with ≈25,000 sequences. I'm only interested in calling variants on one of those sequences (call it my bullseye). I included the others during mapping so that reads that amp ambiguously to bullseye and elsewhere will not be considered a good mapping on the bullseye.

    Now I am into the genotyping stage, but only genotyping on intervals on the bullseye. I've processed this using my own scatter/gather (all intervals are on the bullseye), except that every one of my vcf files has a header containing 25,000 useless lines. I'd like to whittle the non-bullseye lines out of the header for my downstream processing. That's easy enough to do with a shell script, but then I would need to rebuild the index file if I want to use these downstream.

    If you run any GATK tool on a vcf that doesn't have an index, it will automatically generate one for you, so knock yourself out :)

    how we do that if we are providing a file, not generated from GATK?


    I have a vcf file but it does not seem to conform to the stadard vcf file. I am unable to run tabix on it. Any advice?
    and it seems i amnot able to upload it here...any help?


    Hi @ibseq,

    In what way is your vcf file unconventional? When you run ValidateVariants on your VCF, what does the message say?

    I believe the .idx index is for uncompressed VCFs. The .tbi tabix index is for block-compressed .gz VCFs. Be sure you are tabix indexing a block-compressed vcf.

    To upload here, just change the .vcf extension (of the uncompressed file) to .txt. You can do this on a Mac by right-clicking on a file, then selecting Get Info.

    It seems not work.

    I used tabix to index Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, resulting in a file named Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

    But for 1000G_phase1.indels.b37.vcf.gz (downloaded from ftp.broadinstitute.org ), it said tbx_index_build failed: 1000G_phase1.indels.b37.vcf.gz

    When I run gatk (version 3.7-0-gcfedb67) with /scratch/1000G_phase1.indels.b37.vcf.gz, it said ERROR MESSAGE: An index is required, but none found., for input source: /scratch/1000G_phase1.indels.b37.vcf.gz

    The cmd is gatk -T RealignerTargetCreator -fixMisencodedQuals -nt $ncore --known /scratch/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known /scratch/1000G_phase1.indels.b37.vcf.gz -R /scratch/index_genome/hg38.fa -I ../results/bam_bwa/H_sorted_nodup.bam -o H_indelRealigner.intervals

    Thank you.

    Hi @vz33,

    So you are showing that tabix is giving errors for various files. Make sure you're using the latest version of the software. If you still get an error, you'll have to take this up with the Samtools folks.

    Not every one of our tools can index on the fly. There is a certain class of tools that process VCFs that can index on the fly. My guess is these types of tools are marked with tools.walkers.variantutils in their code. For example, to index a VCF on the fly using GATK, we recommend using ValidateVariants or SelectVariants and these tools are such variant walkers.

    Since its been a while since this issue was first raised, I was wondering if there might be an update on how to generate a .idx file for a .vcf?

    This is not helpful, because the goal is to have the .idx pre-generated so that you do not have to wait until running a GATK job to generate it. For example, in my pipeline, I generate a .vcf from LoFreq, Strelka, Pindel, etc., and then want to run a number of different GATK tools on it in parallel. Every GATK tool I try to run now spends a significant amount of time generating the .idx file needed, sometimes on the order of many minutes, even if the underlying GATK command only actually takes a few seconds to execute. When you scale this out over large batches of samples being processed in many different parallel tasks, this ends up being many hours of compute time wasted generating the same file over and over again.

    Would be so much easier if I could just generate the .idx myself once and then pass it along with the .vcf in my pipeline to all the steps that require it.

    Is there a recommended way to do that?

    Yes — GATK4 includes a tool called IndexFeatureFile that can do this for you.

    @Geraldine_VdAuwera , is this tool the same as the function of vcf index in igv?
    thanks a lot

    we are in this thread, have you paste the wrong link? thanks a lot @bhanuGandham

    Sorry yes I linked the wrong doc. Take a look at this for info how how igv creates indices: https://software.broadinstitute.org/software/igv/igvtools_commandline

