Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

SelectVariants V4 TribbleException Contig chr1 does not have a length field

I indexed my VCF file with GATK V4.0.6.0 IndexFeatureFile, then ran GATK V4.0.6.0 SelectVariants on it, and I got an exception:

htsjdk.tribble.TribbleException: Contig chr1 does not have a length field.

When I run the same VCF using GATK V3 SelectVariants, it works.

As far as I know, ##contig entries in the VCF header should NOT have a length in them.

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tedtoal
    Hi,

    Hmm. Can you check if this happens without .gz files?

    Thanks,
    Sheila

  • tedtoaltedtoal Member

    Yes, it does. And when I run GATK4 ValidateVariants on the vcf file, I also get the same exception. When I run GATK3 ValidateVariants on it, I don't get an exception but I do get this error:

    ERROR MESSAGE: Contig chr1 does not have a length field.

    chr1 is the first contig line in the vcf, and it looks like this:

    contig=<ID=chr1>

    I would think the tool could get the chr1 length from the reference.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tedtoal
    Hi,

    I see. Did you generate the VCF with GATK tools?

    -Sheila

  • tedtoaltedtoal Member

    No, it was generated with multiSNV.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tedtoal
    Hi,

    Our response would be that we don't support VCFs not generated by HaplotypeCaller, but I am curious why GATK3 ValidateVariants catches this and GATK4 does not.

    Can you try adding in length and assembly? For example, ##contig=<ID=20,length=63025520,assembly=b37>. The GATK tools do seem to require that information for the header lines and cannot take it from the reference. I think it is part of the validation.

    -Sheila

  • tedtoaltedtoal Member

    Oh gosh...

    The 2018 VCF spec says the contig field TYPICALLY includes the length, but it doesn't say it is mandatory. It doesn't even mention an assembly ID!

    Where would I get the assembly ID if I'm using a reference genome I built myself?

    I've never had this problem before, and I thought I had used GATK tools with VCF files from other callers including multiSNV.

  • tedtoaltedtoal Member

    Why does GATK3 work ok on a compressed vcf but not on a plain vcf? And why doesn't GATK4 behave the same way?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tedtoal
    Hi,

    Where would I get the assembly ID if I'm using a reference genome I built myself? Why does GATK3 work ok on a compressed vcf but not on a plain vcf? And why doesn't GATK4 behave the same way?

    Can you try making one up? I will check with the team why this is happening if it did not happen before.

    -Sheila

  • tedtoaltedtoal Member

    I have a feeling that I don't need the assembly ID, and that the length will be sufficient. I've modified my VCF files to include the length and will see how it goes.

  • danilovkiridanilovkiri Moscow, RussiaMember

    The same error occurs when running the GATK 4.0.7.0 BaseRecalibrator with dbSNP VCF file processed (in any way) by bcftools. Bcftools adds contig ID fields to the VCF header without the corresponding lengths which causes the error. Deleting these lines solves the problem. Then what are all these errors about?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @danilovkiri
    Hi,

    I think the VCF spec requires the length fields, or GATK needs to make sure the lengths of the contigs match in the VCF and reference FASTA.

    -Sheila

  • tedtoaltedtoal Member

    The VCF spec doesn't say that the LENGTH attribute is mandatory. I think GATK tools should not require it. Furthermore, since GATK3 works without error on a compressed VCF with no LENGTH, it should not produce an error when run with an uncompressed VCF. Also, GATK4 should be producing an error message rather than an exception when this error occurs. Further, GATK4 should behave the same way as GATK3 in this regard. Checking for matching length between contigs and reference should only be done if the contigs include the LENGTH attribute.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tedtoal
    Hi,

    I agree. Can you submit a bug report so I can let the team know? Instructions are here.

    Thanks,
    Sheila

  • tedtoaltedtoal Member

    Okay, I have uploaded a zip file with the bug report. It is named:

    twtoal/TribbleExceptionError.zip

  • RMuletRMulet BarcelonaMember

    What is the status of this issue? Are there any plans to change the behaviour of GATK in this respect? I am using GATK4 and, despite the reportedly increased laxity of this version, I encounter the same problem. Wouldn't it be possible to add an option to disable dictionary validation altogether? It can be really frustrating sometimes...

    In particular, I am running HaplotypeCaller with the --dbsnp option and a dbSNP file that has been processed with bcftools. As reported above, this adds the "contig" lines to the VCF header, but without the LENGTH that GATK apparently requires. I could either remove those lines or add LENGTH, but the file is quite big and it would take some time. In the end I just ignored the --dbsnp option because it's not adding essential to my analysis and I want to get the job done.

    It's not an insurmountable obstacle, but it took me a few minutes to figure out what was going out and it would have taken even longer had I wanted the SNP id in my variant calling file. GATK tools are great when it comes to performance, but personally I find them unnecessarily stringent sometimes.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @RMulet,

    What you are asking for is already a feature. All you have to do is set --disable-sequence-dictionary-validation true.

    --disable-sequence-dictionary-validation,-disable-sequence-dictionary-validation:Boolean
                                  If specified, do not check the sequence dictionaries from our inputs for compatibility.
                                  Use at your own risk!  Default value: false. Possible values: {true, false} 
    

    If the tool behavior is unexpected, then please post the command and error message.

Sign In or Register to comment.