Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

Hi Team,
I'm getting `WARN  21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation` when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs.
This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce `java.lang.OutOfMemoryError: Java heap space` errors even with 4X64GB!

Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?
Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
893
State
closed
Last Updated
Closed By
vdauwera

Best Answer

Answers

  • KlausNZKlausNZ Member ✭✭

    Sorry for the formatting - chrome does not recover graciously from 'Preview'!
    All done with version 3.3.
    Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @KlausNZ,

    GVCFs require a special indexing scheme to deal with the large amount of records. It's possible that this is not being managed properly when using gzipped files. I'll put in a ticket to get this looked at, but in the meantime my recommendation is to not gzip the GVCFs.

  • KlausNZKlausNZ Member ✭✭

    Hi Geraldine, many thanks for answering this. Just to clarify: GenotypeGVCFs issues the warning in both cases (GVCFs compressed outside GATK with gzip+tabix (as per post 3893) OR compressed by HaplotypeCaller (-o foo.g.vcf.gz). But: GenotypeGVCF runtimes are shorter in the second case, although still longer than uncompressed.
    All good, many thanks for the great tools!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for the clarification, that makes intuitive sense -- and might prove useful info if we ever move to address this. (But don't hold your breath).

  • pdexheimerpdexheimer Member ✭✭✭✭

    So for whatever it's worth, I gzip all of my gVCFs through GATK and haven't encountered any major issues - though I haven't carefully evaluating the effect on runtime.

    I did run into the OutOfMemory error yesterday. In this case, I hadn't gone through the CombineGVCFs process in a while and was trying to GGVCF ~400 files. However, combining into aggregate gVCFs of 100 samples each solved the problem - with 932 samples in 41 files, everything is running smoothly again

  • KlausNZKlausNZ Member ✭✭

    @pdexheimer, Many thanks, this is useful to know. I have since re-created all g.vcfs with HaplotypeCaller (same scripts, same bam files, same GATK version), and can confirm that runtime issues have disappeared with the re-created g.vcfs.

    Run time problems were much less pronounced with the files gzipped by HC as compared to gzip+tabix, which may explain why you haven't noticed these (alternatively, the problems are specific to our hardware) - but also indicates that HC -compressed files differ from those produced by gzip+tabix?
    It's not a big problem in practice (breathing normally ;-), thanks again for all your help!

  • pdexheimerpdexheimer Member ✭✭✭✭

    Just to clarify, are you using gzip or bgzip to compress?

  • KlausNZKlausNZ Member ✭✭

    I asked myself the same question that night. I had to load tabix just to access bgzip, but wondered whether I had accidentally gzipped one or more of the files in the after-midnight compression frenzy. I decided that this did not happen because 1) tabix terminates with an error when operating on gzipped files ([tabix] was bgzip used to compress this file?), and 2) I had *.tbi files for all my *.g.vcf.gz files, indicating that tabix completed successfully for each file.

    Unfortunately the commands are no longer in my history, and I have deleted the compressed files, so not sure how to conclusively exclude a possible error.

    If you think that the presence at least one gzipped input file for GenotypeGVCFs would be a plausible explanation of my problems I'll be happy to experiment a little to see whether we can move this from the 'issues' to the 'operator error' list.

    Many thanks for your thoughts on this!

  • pdexheimerpdexheimer Member ✭✭✭✭

    Well, it seemed plausible, but I didn't know what the error mode is for trying to tabix a gzipped file. But if tabix fails altogether rather than just silently degrading performance, it doesn't seem like it would explain what you saw

  • KlausNZKlausNZ Member ✭✭

    Yes, seemed plausible to me, too ;-)

  • bbimberbbimber HomeMember

    Hello - i know there's a bookmarked answer, but I wanted to clarify one point. Are you saying this issue is specific to tabix, meaning that gzipped gVCFs indexed via GATK would not have this problem?

  • KlausNZKlausNZ Member ✭✭

    Yes - I now routinely output g.vcf.gz from HC (hundreds since my original post) and they pose no problem for GenotypeGVCFs.
    If I want to inspect one of these files, I make a copy, decompress, inspect, and discard it. I haven't used GenotypeGVCFs 3.4-0 yet, so the issue may be gone now.
    In short - it's a great tool!
    Have fun!
    K

  • vsvintivsvinti Member ✭✭

    Same problem here with bgzip and tabix applied to vcf files -- with gatk 3.4. Trying to compress during CombineGVCFs to see if it makes a difference.

  • vsvintivsvinti Member ✭✭

    Yes, same here : working if compressing is done with gatk - but not with bgzip or tabix for gatk 3.4.

  • zzqzzq ChinaMember
    edited March 2016

    Hi @Geraldine_VdAuwera @Sheila

    I am getting the same warning (GATK v3.5). I do not know whether these warnings will have an effect on the result. I have more than 100 samples and run HaplotypeCaller for each sample with the --emitRefConfidence GVCF option. After that, I found the g.vcf are so large that I compressed them using bgzip and tabix. But when I am running CombineGVCFs, it will give me the warning as above.

    what is more, for so many gvcf files, I just combined 20 gvcf files at a time. At last, I will combine those combined gvcf again to get the final result. I think this will save a lot of time, but I am just curious about the differences between above method and combined all gvcf files at one time.

    I hope you help me with above questions.

    Thanks

    best wishes

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @zzq As discussed in this thread, you can avoid this problem by having the GATK tools themselves emit gzipped files. We can't help you deal with files that were gzipped differently after the fact, sorry. And we have not tried to do this so we don't know what the consequences might be. It's probably ok if it runs to completion but we can't guarantee that.

  • zzqzzq ChinaMember

    Hi @Geraldine_VdAuwera

    Many thanks, but for the CombineGVCFs, can I separate the gvcf files into small, then combine those combined gvcf again to get the final result ?

    Many thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you can combine batches and then combine the batches again.

  • zzqzzq ChinaMember

    Hi @Geraldine_VdAuwera ,
    Many thanks, you said I can avoid this problem by having the GATK tools themselves emit gzipped files, but I got the same warnings from CombineGVCFs (GATK v3.5) when I used .g.vcf.gz / .g.vcf.gz.tbi inputs (generated by HaplotypeCaller, run in -o .g.vcf.gz with --emitRefConfidence BP_RESOLUTION) . And, after seen the post in [http://gatkforums.broadinstitute.org/gatk/discussion/6447/gzipped-gvcf-files], I feel so confused about this warning when we compressed the g.vcf file by bgzip and GATK.

    many thanks.

    Issue · Github
    by Sheila

    Issue Number
    696
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @zzq, my apologies -- I did not explain the expected behavior correctly in my previous comment. What I should have said was that if you do the gzipping directly through GATK, we are confident that the results will be correct and so you can ignore the warnings safely. However, the warnings themselves will not go away because we have not solved the technical cause of the warning. Also, any performance cost (typically it will proceed more slowly) remains.

  • gaelgarciagaelgarcia Member
    edited May 2018

    Hi all,

    I realize this is an old thread, but I'm encountering the same problem when running VariantsToBinaryPed on my (5GB) .vcf.bgzfile (+ .vcf.bgz.tbi). I inherited these files, so I can only assume the VCF was compressed and indexed usingtabix`in order to generate this fileset.

    If I understand correctly from the above, the three options I have to overcome this error are:

    a) Re-run Haplotype caller instructing GATK to output a compressed file. (Will this generate an index file, too?) [2]

    b) Run VariantsToBinaryPed on the uncompressed .vcf. -- I have unzipped it and it is 50GB, but how do I generate the accompanying index, which VariantsToBinaryPed requires? [1]

    c) Recompress with gzip as .gz (without tabix) and then run the VariantsToBinaryPed tool. Again, here I'm unsure as to how to generate the accompanying index, which VariantsToBinaryPed requires. [3]

    Thank you.

    1.

    It turns out this is a limitation due to the use of Tabix indexing for gzipped vcfs. This is not something we can address at this time, so I'm afraid you're stuck with unzipped gvcfs for now.

    2.

    [...] you can avoid this problem by having the GATK tools themselves emit gzipped files.

    3.

    [...] this issue is specific to tabix, meaning that gzipped gVCFs indexed via GATK would not have this problem

  • SheilaSheila Broad InstituteMember, Broadie admin

    @gaelgarcia
    Hi,

    a) Yes, and a VCF index will be produced as well.

    b) The tools will automatically generate an index for the VCF.

    c) You will need to use Tabix for both.

    -Sheila

  • tedtoaltedtoal Member

    I used GATK v4 IndexFeatureFile, followed by GATK3 SelectVariants. SelectVariants reported the warning discussed above:

    IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation

    According to the discussion above, I expected that using a GATK tool (IndexFeatureFile) to index a .vcf.gz file would eliminate this warning.

    What gives?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @tedtoal
    Hi,

    Did the tool run to completion? Which version of GATK4 are you using?

    -Sheila

  • tedtoaltedtoal Member

    Yes, both IndexFeatureFile and SelectVariants ran to completion. I'm using GATK3 version 3.6.0 (SelectVariants) and GATK4 version 4.0.6.0 (IndexFeatureFile).

  • SheilaSheila Broad InstituteMember, Broadie admin

    @tedtoal
    Hi,

    I am not sure why the WARN statement is being output. Does this happen with uncompressed files too?

    -Sheila

  • tedtoaltedtoal Member
    edited August 2018

    No. With an uncompressed VCF file, it gives this error:

    ##### ERROR MESSAGE: Contig chr1 does not have a length field.
    

    I think I reported this already in a different post in a different thread.

    The ##contig statement for chr1 in the VCF file is:

    ##contig=<ID=chr1>
    
  • SheilaSheila Broad InstituteMember, Broadie admin

    @tedtoal
    Hi,

    I see. I will respond in that thread.

    -Sheila

  • MehulSMehulS Member

    I'm getting the same warning but with while putting the --dbSNP option in GenotypeGVCFs

    09:52:11.670 WARN IndexUtils - Feature file "/directory/dbSNP.vcf " appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @MehulS ,

    The previous posts in this thread suggest using GATK to create an index file for the VCF. Have you tried generating an index file for dbSNP.vcf using IndexFeatureFile?

    Also you may find this thread post regarding vcf stringency in GATK helpful.

  • LindaLinda Member
    @bshifaw I get the same warning as follows while using --known-sites dbsnp_146.hg38.vcf in BaseRecalibrator. I download the dbsnp_146.hg38.vcf.gz from gatk bundle and uncompress it. Make the index for it with gatk4.1.1.0 using indexFeatureFile. The log file gave the warning message but no err. I am not sure if the warning will affect my result and how to resolve the warning.

    "05:08:35.360 WARN IndexUtils - Feature file "/data2/gminix/project_new/fangling/database/gatk_bundle/dbsnp_146.hg38.vcf" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
    "
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 9

    Hi @Linda

    Warnings should not effect your results. Unless there is an error you are good.

Sign In or Register to comment.