generate an idx file for a vcf

Hello,

I have a vcf file that does not have an associated index file. Is it possible to create one? If yes, what module do I need to use?

Thanks for your input!

~Mika.

Tagged:

Best Answers

Answers

  • SteveLSteveL BarcelonaMember ✭✭

    We use tabix ( http://www.htslib.org/doc/tabix.html ) from Samtools, on the command line, naturally.

  • BobHarrisBobHarris earthMember

    Does a spec exist for the index file format?

    From what I've been able to find, the .vcf.idx file that UnifiedGenotyper creates is a different format than tabix. Hexdump reveals the .vcf.idx file has a magic number TIDX (54 49 44 58) while the tabix spec says it has TBI\1 (http://samtools.github.io/hts-specs/tabix.pdf). Not clear to me what the \1 is supposed to mean, but these obviously are different files. I did find some discussions that seem to indicate a tabix file can be used with GATK, but I'm not sure that's universally true so I'd like to stick with .vcf.idx.

    Why I am asking. In my use case I mapped reads to a reference with ≈25,000 sequences. I'm only interested in calling variants on one of those sequences (call it my bullseye). I included the others during mapping so that reads that amp ambiguously to bullseye and elsewhere will not be considered a good mapping on the bullseye.

    Now I am into the genotyping stage, but only genotyping on intervals on the bullseye. I've processed this using my own scatter/gather (all intervals are on the bullseye), except that every one of my vcf files has a header containing 25,000 useless lines. I'd like to whittle the non-bullseye lines out of the header for my downstream processing. That's easy enough to do with a shell script, but then I would need to rebuild the index file if I want to use these downstream.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you run any GATK tool on a vcf that doesn't have an index, it will automatically generate one for you, so knock yourself out :)

  • ibseqibseq United KingdomMember

    @Geraldine_VdAuwera said:
    If you run any GATK tool on a vcf that doesn't have an index, it will automatically generate one for you, so knock yourself out :)

    how we do that if we are providing a file, not generated from GATK?

    thanks,
    ibseq

  • ibseqibseq United KingdomMember

    @Geraldine_VdAuwera said:
    Or you can just run a GATK job on it and GATK will auto-generate an index for your vcf :)

    HI,
    I have a vcf file but it does not seem to conform to the stadard vcf file. I am unable to run tabix on it. Any advice?
    and it seems i amnot able to upload it here...any help?

    thanks
    ibseq

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @ibseq,

    In what way is your vcf file unconventional? When you run ValidateVariants on your VCF, what does the message say?

    I believe the .idx index is for uncompressed VCFs. The .tbi tabix index is for block-compressed .gz VCFs. Be sure you are tabix indexing a block-compressed vcf.

    To upload here, just change the .vcf extension (of the uncompressed file) to .txt. You can do this on a Mac by right-clicking on a file, then selecting Get Info.

  • vz33vz33 beijingMember

    @Geraldine_VdAuwera said:
    Or you can just run a GATK job on it and GATK will auto-generate an index for your vcf :)

    It seems not work.

    I used tabix to index Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, resulting in a file named Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

    But for 1000G_phase1.indels.b37.vcf.gz (downloaded from ftp.broadinstitute.org ), it said tbx_index_build failed: 1000G_phase1.indels.b37.vcf.gz

    When I run gatk (version 3.7-0-gcfedb67) with /scratch/1000G_phase1.indels.b37.vcf.gz, it said ERROR MESSAGE: An index is required, but none found., for input source: /scratch/1000G_phase1.indels.b37.vcf.gz

    The cmd is gatk -T RealignerTargetCreator -fixMisencodedQuals -nt $ncore --known /scratch/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known /scratch/1000G_phase1.indels.b37.vcf.gz -R /scratch/index_genome/hg38.fa -I ../results/bam_bwa/H_sorted_nodup.bam -o H_indelRealigner.intervals

    Thank you.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @vz33,

    So you are showing that tabix is giving errors for various files. Make sure you're using the latest version of the software. If you still get an error, you'll have to take this up with the Samtools folks.

    Not every one of our tools can index on the fly. There is a certain class of tools that process VCFs that can index on the fly. My guess is these types of tools are marked with tools.walkers.variantutils in their code. For example, to index a VCF on the fly using GATK, we recommend using ValidateVariants or SelectVariants and these tools are such variant walkers.

Sign In or Register to comment.