GenotypeGVCFs sees DP incorrectly in INFO, not FORMAT field

roelroel AmsterdamMember
edited May 2015 in Ask the GATK team

I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output.

Any idea what could be the problem?

INFO  18:30:12,694 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22 
INFO  18:30:12,699 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz 
INFO  18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13. 
INFO  18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12 
INFO  18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT 
INFO  18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals 
WARN  18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine 
INFO  18:30:29,878 GenomeAnalysisEngine - Preparing for traversal 
INFO  18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal 
INFO  18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  18:30:29,965 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  18:30:29,966 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
INFO  18:31:00,420 ProgressMeter -       1:4845033         0.0    30.0 s      50.3 w        0.0%    46.7 h      46.7 h 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
    at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176)
    at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:221)
    at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182)
    at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:269)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:351)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:119)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
##### ERROR ------------------------------------------------------------------------------------------

for f in *.g.vcf.gz; do echo -e "\n-- $f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:21:0,0,432
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:20:0:20:0,0,410
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,773
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:25:2:25:0,3,790
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,837
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:23:31:23:0,31,765
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:21:39:0,39,488,60,491,512:20,0,0,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:18:0:18:0,0,514
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,810
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,812
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:25:0,0,624
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0.08    .       ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:17:20:20,0,311,62,320,382:14,0,3,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     6.76    .       ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:25:43:43,0,401,102,417,519:20,0,3,2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:23:1:0,1,395,56,406,460:19,0,0,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:28:0,0,626
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     5.99    .       ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:18:42:42,0,293,84,305,388:13,1,3,1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:22:0,0,558
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       GA,<NON_REF>    6.76    .       ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00        GT:DP:GQ:PL:SB  0/1:19:43:43,0,262,87,274,361:12,3,4,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     16.82   .       ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:21:54:54,0,352,102,367,470:16,0,4,1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:26:0:25:0,0,419
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:30:0:30:0,0,771
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:34:77:34:0,78,1136
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:26:0:20:0,0,397
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GAA     G,GA,<NON_REF>  22.75   .       ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00        GT:DP:GQ:PL:SB  0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0
Post edited by roel on

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @roel
    Hi,

    I have seen this error when one of the input gvcfs has the INFO DP defined in the header, but others do not. Can you check if any of your input gvcfs have INFO DP defined in the header? You can copy the header field from the gvcf that has it, and paste it into the other gvcf headers.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @roel
    Hi again,

    Which version of Haplotype Caller did you use to generate the GVCFs?

    Can you please post the entire header of one of the GVCF files, not just lines extracted using unix tools?

    Thanks,
    Sheila

  • roelroel AmsterdamMember

    Hi Sheila,

    Thanks for looking into this. I run the same pipeline for GRCh38 (human) and GRCm38 (mouse), in the latter it seems I observe this error for some samples. None of my input gvcfs have INFO DP defined in the header. The haplotypecaller is the same version: v3.3-0-geee94ec. All gvcfs were created using identical commands. only the ##GATKCommandLine and #CHROM lines differ across gvcfs.

    Hmm, I see now there was an error during the gvcf generation. It is attached, although the error may be unrelated. gvcfs seem completed, or almost, and the GenotypeGVCFs error I initially reported occurs early on chromosome 1.

    Roel

    ##fileformat=VCFv4.1
    ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
    ##FILTER=<ID=LowQual,Description="Low quality">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
    ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
    ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
    ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
    ##GATKCommandLine=<ID=HaplotypeCaller,Version=3.3-0-geee94ec,Date="Thu May 07 03:12:07 CEST 2015",Epoch=1430961127308,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[3342_9_14MCB172_CGCTGATC_tR1-mdup-ra-bq.bam] showFullBamList=false read_buffer_size=null phone_home=NO_ET gatk_key=null tag=NA read_filter=[FailsVendorQualityCheck, MappingQualityUnavailable, UnmappedRead, BadCigar] intervals=[/net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SureSelect/ex100/S0276129_Regions-mm10-ex100.bed] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=NONE downsample_to_fraction=null downsample_to_coverage=null baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=20 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=3 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=128 mergeVariantsViaLD=false doNotRunPhysicalPhasing=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=250 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
    ##GVCFBlock=minGQ=0(inclusive),maxGQ=1(exclusive)
    ##GVCFBlock=minGQ=1(inclusive),maxGQ=2(exclusive)
    ##GVCFBlock=minGQ=10(inclusive),maxGQ=11(exclusive)
    ##GVCFBlock=minGQ=11(inclusive),maxGQ=12(exclusive)
    ##GVCFBlock=minGQ=12(inclusive),maxGQ=13(exclusive)
    ##GVCFBlock=minGQ=13(inclusive),maxGQ=14(exclusive)
    ##GVCFBlock=minGQ=14(inclusive),maxGQ=15(exclusive)
    ##GVCFBlock=minGQ=15(inclusive),maxGQ=16(exclusive)
    ##GVCFBlock=minGQ=16(inclusive),maxGQ=17(exclusive)
    ##GVCFBlock=minGQ=17(inclusive),maxGQ=18(exclusive)
    ##GVCFBlock=minGQ=18(inclusive),maxGQ=19(exclusive)
    ##GVCFBlock=minGQ=19(inclusive),maxGQ=20(exclusive)
    ##GVCFBlock=minGQ=2(inclusive),maxGQ=3(exclusive)
    ##GVCFBlock=minGQ=20(inclusive),maxGQ=21(exclusive)
    ##GVCFBlock=minGQ=21(inclusive),maxGQ=22(exclusive)
    ##GVCFBlock=minGQ=22(inclusive),maxGQ=23(exclusive)
    ##GVCFBlock=minGQ=23(inclusive),maxGQ=24(exclusive)
    ##GVCFBlock=minGQ=24(inclusive),maxGQ=25(exclusive)
    ##GVCFBlock=minGQ=25(inclusive),maxGQ=26(exclusive)
    ##GVCFBlock=minGQ=26(inclusive),maxGQ=27(exclusive)
    ##GVCFBlock=minGQ=27(inclusive),maxGQ=28(exclusive)
    ##GVCFBlock=minGQ=28(inclusive),maxGQ=29(exclusive)
    ##GVCFBlock=minGQ=29(inclusive),maxGQ=30(exclusive)
    ##GVCFBlock=minGQ=3(inclusive),maxGQ=4(exclusive)
    ##GVCFBlock=minGQ=30(inclusive),maxGQ=31(exclusive)
    ##GVCFBlock=minGQ=31(inclusive),maxGQ=32(exclusive)
    ##GVCFBlock=minGQ=32(inclusive),maxGQ=33(exclusive)
    ##GVCFBlock=minGQ=33(inclusive),maxGQ=34(exclusive)
    ##GVCFBlock=minGQ=34(inclusive),maxGQ=35(exclusive)
    ##GVCFBlock=minGQ=35(inclusive),maxGQ=36(exclusive)
    ##GVCFBlock=minGQ=36(inclusive),maxGQ=37(exclusive)
    ##GVCFBlock=minGQ=37(inclusive),maxGQ=38(exclusive)
    ##GVCFBlock=minGQ=38(inclusive),maxGQ=39(exclusive)
    ##GVCFBlock=minGQ=39(inclusive),maxGQ=40(exclusive)
    ##GVCFBlock=minGQ=4(inclusive),maxGQ=5(exclusive)
    ##GVCFBlock=minGQ=40(inclusive),maxGQ=41(exclusive)
    ##GVCFBlock=minGQ=41(inclusive),maxGQ=42(exclusive)
    ##GVCFBlock=minGQ=42(inclusive),maxGQ=43(exclusive)
    ##GVCFBlock=minGQ=43(inclusive),maxGQ=44(exclusive)
    ##GVCFBlock=minGQ=44(inclusive),maxGQ=45(exclusive)
    ##GVCFBlock=minGQ=45(inclusive),maxGQ=46(exclusive)
    ##GVCFBlock=minGQ=46(inclusive),maxGQ=47(exclusive)
    ##GVCFBlock=minGQ=47(inclusive),maxGQ=48(exclusive)
    ##GVCFBlock=minGQ=48(inclusive),maxGQ=49(exclusive)
    ##GVCFBlock=minGQ=49(inclusive),maxGQ=50(exclusive)
    ##GVCFBlock=minGQ=5(inclusive),maxGQ=6(exclusive)
    ##GVCFBlock=minGQ=50(inclusive),maxGQ=51(exclusive)
    ##GVCFBlock=minGQ=51(inclusive),maxGQ=52(exclusive)
    ##GVCFBlock=minGQ=52(inclusive),maxGQ=53(exclusive)
    ##GVCFBlock=minGQ=53(inclusive),maxGQ=54(exclusive)
    ##GVCFBlock=minGQ=54(inclusive),maxGQ=55(exclusive)
    ##GVCFBlock=minGQ=55(inclusive),maxGQ=56(exclusive)
    ##GVCFBlock=minGQ=56(inclusive),maxGQ=57(exclusive)
    ##GVCFBlock=minGQ=57(inclusive),maxGQ=58(exclusive)
    ##GVCFBlock=minGQ=58(inclusive),maxGQ=59(exclusive)
    ##GVCFBlock=minGQ=59(inclusive),maxGQ=60(exclusive)
    ##GVCFBlock=minGQ=6(inclusive),maxGQ=7(exclusive)
    ##GVCFBlock=minGQ=60(inclusive),maxGQ=70(exclusive)
    ##GVCFBlock=minGQ=7(inclusive),maxGQ=8(exclusive)
    ##GVCFBlock=minGQ=70(inclusive),maxGQ=80(exclusive)
    ##GVCFBlock=minGQ=8(inclusive),maxGQ=9(exclusive)
    ##GVCFBlock=minGQ=80(inclusive),maxGQ=90(exclusive)
    ##GVCFBlock=minGQ=9(inclusive),maxGQ=10(exclusive)
    ##GVCFBlock=minGQ=90(inclusive),maxGQ=99(exclusive)
    ##GVCFBlock=minGQ=99(inclusive),maxGQ=2147483647(exclusive)
    ##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
    ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
    ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
    ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
    ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
    ##contig=<ID=1,length=195471971>
    ##contig=<ID=10,length=130694993>
    ##contig=<ID=11,length=122082543>
    ##contig=<ID=12,length=120129022>
    ##contig=<ID=13,length=120421639>
    ##contig=<ID=14,length=124902244>
    ##contig=<ID=15,length=104043685>
    ##contig=<ID=16,length=98207768>
    ##contig=<ID=17,length=94987271>
    ##contig=<ID=18,length=90702639>
    ##contig=<ID=19,length=61431566>
    ##contig=<ID=2,length=182113224>
    ##contig=<ID=3,length=160039680>
    ##contig=<ID=4,length=156508116>
    ##contig=<ID=5,length=151834684>
    ##contig=<ID=6,length=149736546>
    ##contig=<ID=7,length=145441459>
    ##contig=<ID=8,length=129401213>
    ##contig=<ID=9,length=124595110>
    ##contig=<ID=MT,length=16299>
    ##contig=<ID=X,length=171031299>
    ##contig=<ID=Y,length=91744698>
    ##contig=<ID=JH584299.1,length=953012>
    ##contig=<ID=GL456233.1,length=336933>
    ##contig=<ID=JH584301.1,length=259875>
    ##contig=<ID=GL456211.1,length=241735>
    ##contig=<ID=GL456350.1,length=227966>
    ##contig=<ID=JH584293.1,length=207968>
    ##contig=<ID=GL456221.1,length=206961>
    ##contig=<ID=JH584297.1,length=205776>
    ##contig=<ID=JH584296.1,length=199368>
    ##contig=<ID=GL456354.1,length=195993>
    ##contig=<ID=JH584294.1,length=191905>
    ##contig=<ID=JH584298.1,length=184189>
    ##contig=<ID=JH584300.1,length=182347>
    ##contig=<ID=GL456219.1,length=175968>
    ##contig=<ID=GL456210.1,length=169725>
    ##contig=<ID=JH584303.1,length=158099>
    ##contig=<ID=JH584302.1,length=155838>
    ##contig=<ID=GL456212.1,length=153618>
    ##contig=<ID=JH584304.1,length=114452>
    ##contig=<ID=GL456379.1,length=72385>
    ##contig=<ID=GL456216.1,length=66673>
    ##contig=<ID=GL456393.1,length=55711>
    ##contig=<ID=GL456366.1,length=47073>
    ##contig=<ID=GL456367.1,length=42057>
    ##contig=<ID=GL456239.1,length=40056>
    ##contig=<ID=GL456213.1,length=39340>
    ##contig=<ID=GL456383.1,length=38659>
    ##contig=<ID=GL456385.1,length=35240>
    ##contig=<ID=GL456360.1,length=31704>
    ##contig=<ID=GL456378.1,length=31602>
    ##contig=<ID=GL456389.1,length=28772>
    ##contig=<ID=GL456372.1,length=28664>
    ##contig=<ID=GL456370.1,length=26764>
    ##contig=<ID=GL456381.1,length=25871>
    ##contig=<ID=GL456387.1,length=24685>
    ##contig=<ID=GL456390.1,length=24668>
    ##contig=<ID=GL456394.1,length=24323>
    ##contig=<ID=GL456392.1,length=23629>
    ##contig=<ID=GL456382.1,length=23158>
    ##contig=<ID=GL456359.1,length=22974>
    ##contig=<ID=GL456396.1,length=21240>
    ##contig=<ID=GL456368.1,length=20208>
    ##contig=<ID=JH584292.1,length=14945>
    ##contig=<ID=JH584295.1,length=1976>
    ##reference=file:///net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  3342_9_14MCB172_CGCTGATC_tR1.fq.gz
    
  • roelroel AmsterdamMember

    It seems the GenotypeGVCFs error is caused by a region in the bed file that is too large:

    MT 16018 16368

    Needless to say, GATK usability would improve significantly if it could cope with small issues like this, or at least report them early and clearly. I'll fix it and rerun to see whether it helps.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's not a very large region; should be a breeze to handle. There must be something else going on in there. Let us know what happens when you rerun.

  • roelroel AmsterdamMember
    edited May 2015

    A rerun with the corrected MT region resolves the GenotypeGVCFs error, but as you predicted, the issue in GenotypeGVCFs persists. I've also tried, without avail, running it without the --intervals <targets>.bed, with a single -V <sample>.g.vcf.gz, or with the -V:<sample>,VCF <sample>.g.vcf.gz
    points of interest may be that:

    • If I add a would-be valid '^##INFO=<ID=DP,...>' line in the .g.vcf.gz then GATK complains instead about a 'Key AD found in VariantContext field FORMAT' at the same site, which is even less correct, since this vcf does not have an AD in the INFO field, nor in the FORMAT.
    • Without the --intervals line, or with a different single sample I get the same error but at a different position.
    • with different single samples I also get the error at different positions, and it seems always with an ALT in the form of
      [ACTG],<NON_REF>, in one instance there are even two ALTs: 1 5019252 . CGGGG C,CGG,<NON_REF>
      Although for all samples the errror occurs early, it does not seem to be the first instance where the alternative has a comma.

    See the attachment. I've grepped the error site with -C 5, so the error mentioned in the GATK output is always in the middle. sample woi is without intervals.

    Thanks for looking into this, if it is not at my end, I hope this helps resolve the issue.
    Roel

  • roelroel AmsterdamMember

    I'm now running the GenotypeGVCFs step with an earlier version, GenomeAnalysisTK-3.1-11-g34477bd/GenomeAnalysisTK.jar, and although I had to remove the --max_alternate_alleles 20 and -ploidy 2 options, it seems to go fine now.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @roel It sounds like you're getting inconsistent file outputs -- I wonder if it has anything to do with the fact that you're sending the output to stdout instead of letting GATK write to file. Are you trying to pipe the output to some other program? Because that's not something we currently support.

  • roelroel AmsterdamMember
    edited May 2015

    @Geraldine, this is true. In the haplotypecaller step and in GenotypeGVCFs I send the output to /dev/stdout, which enables me to bgzip from stdin. But after the haplotypecaller I first egrep -v "^(Using|Total|Time) " to remove the messages printed to stdout that do not belong in the vcf. bgzip and tabix did not complain after this filter, so I do not suspect the issue is caused by this. Or does GenotypeGVCFs or the haplotypecaller fseek on its output files?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I'm not sure but you can test that by running on that interval and letting the output got to file. That will tell us at least if that's where the problem lies or not.

  • roelroel AmsterdamMember

    Sorry for the delay. I still have the same issue in gatk 3.4-1-gab2121b (prior version tested was gatk-3.3) . I have run it now without the parsing of stdout. GATK 3.1 still works as expected.
    As mentioned I only get this in mouse. leaving out the --dbsnp or running it with or without --intervals does not matter.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Looking back at your original post it looks like the INFO field of your GVCFs only contains the END tag and nothing else. That's not normal. There should be other site level annotations there. Are all records like this or just that site?

  • roelroel AmsterdamMember

    Most entries only contain the END=... tag in the INFO, but some contain other entries, MLEAC and MLEAF. This is in the test without dbsnp, where it chokes on 1:5117372. Could it be related that this site in the first file only has the END tag whereas in the second there's the MLEAC and MLEAF?

    zgrep -P -C5 "^1\t5117372\t"  3460_24_K26_ACACAGAA_L002_tjmrb_SPE.g.vcf.gz
    1       5101139 .       C       <NON_REF>       .       .       END=5101145     GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,27
    1       5101146 .       A       <NON_REF>       .       .       END=5101146     GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,25
    1       5101147 .       C       <NON_REF>       .       .       END=5101255     GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    1       5117248 .       A       <NON_REF>       .       .       END=5117318     GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    1       5117319 .       A       <NON_REF>       .       .       END=5117371     GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,27
    1       5117372 .       T       <NON_REF>       .       .       END=5117372     GT:DP:GQ:MIN_DP:PL      0/0:1:0:1:0,0,0
    1       5117373 .       A       <NON_REF>       .       .       END=5117408     GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,31
    1       5117409 .       G       <NON_REF>       .       .       END=5117482     GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    1       5117483 .       A       <NON_REF>       .       .       END=5117512     GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,28
    1       5117513 .       T       <NON_REF>       .       .       END=5117513     GT:DP:GQ:MIN_DP:PL      0/0:1:0:1:0,0,0
    1       5117514 .       A       <NON_REF>       .       .       END=5117568     GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,29
    

    and:

    zgrep -P -C5 "^1\t5117372\t"  3460_23_K25_AAGGTACA_L002_tjmrb_SPE.g.vcf.gz
    1       5117308 .       A       <NON_REF>       .       .       END=5117313     GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,47
    1       5117314 .       C       <NON_REF>       .       .       END=5117317     GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,53
    1       5117318 .       T       <NON_REF>       .       .       END=5117318     GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,52
    1       5117319 .       A       <NON_REF>       .       .       END=5117341     GT:DP:GQ:MIN_DP:PL      0/0:2:6:2:0,6,55
    1       5117342 .       T       <NON_REF>       .       .       END=5117371     GT:DP:GQ:MIN_DP:PL      0/0:3:9:3:0,9,82
    1       5117372 .       T       C,<NON_REF>     69.18   .       MLEAC=2,0;MLEAF=1.00,0.00       GT:DP:GQ:PL:SB  1/1:3:9:96,9,0,96,9,96:0,0,1,2
    1       5117373 .       A       <NON_REF>       .       .       END=5117373     GT:DP:GQ:MIN_DP:PL      0/0:3:9:3:0,9,90
    1       5117374 .       C       <NON_REF>       .       .       END=5117379     GT:DP:GQ:MIN_DP:PL      0/0:4:12:4:0,12,116
    1       5117380 .       T       <NON_REF>       .       .       END=5117382     GT:DP:GQ:MIN_DP:PL      0/0:5:0:5:0,0,92
    1       5117383 .       T       <NON_REF>       .       .       END=5117383     GT:DP:GQ:MIN_DP:PL      0/0:5:15:5:0,15,149
    1       5117384 .       G       <NON_REF>       .       .       END=5117389     GT:DP:GQ:MIN_DP:PL      0/0:5:0:5:0,0,86
    
  • roelroel AmsterdamMember
    edited July 2015

    Comparing the output with a human haplotype called sample, the END tag is there similarly. Conversely, it seems that here I do get an AD tag and others in the variants that also have the MLEAC and MLEAF entries. I suspect that what's going wrong already occurs during haplotype calling, and somehow the should be present AD tag, and others are not added?

    1       3729544 .       C       <NON_REF>       .       .       END=3729545     GT:DP:GQ:MIN_DP:PL      0/0:26:57:26:0,57,855
    1       3729546 .       T       <NON_REF>       .       .       END=3729553     GT:DP:GQ:MIN_DP:PL      0/0:23:60:21:0,60,900
    1       3729554 .       C       <NON_REF>       .       .       END=3729562     GT:DP:GQ:MIN_DP:PL      0/0:20:54:20:0,54,810
    1       3729563 .       G       <NON_REF>       .       .       END=3729563     GT:DP:GQ:MIN_DP:PL      0/0:20:51:20:0,51,765
    1       3729564 .       C       <NON_REF>       .       .       END=3729568     GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
    1       3729569 .       G       <NON_REF>       .       .       END=3729570     GT:DP:GQ:MIN_DP:PL      0/0:19:51:19:0,51,765
    1       3729571 .       G       <NON_REF>       .       .       END=3729572     GT:DP:GQ:MIN_DP:PL      0/0:19:48:19:0,48,720
    1       3729573 .       C       T,<NON_REF>     715.77  .       DP=18;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0  GT:AD:DP:GQ:PL:SB       1/1:0,18,0:18:54:744,54,0,744,54,744:0,0,0,18
    1       3729574 .       C       <NON_REF>       .       .       END=3729575     GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
    1       3729576 .       G       <NON_REF>       .       .       END=3729577     GT:DP:GQ:MIN_DP:PL      0/0:19:45:18:0,45,675
    1       3729578 .       G       <NON_REF>       .       .       END=3729579     GT:DP:GQ:MIN_DP:PL      0/0:19:48:19:0,48,720
    1       3729580 .       A       <NON_REF>       .       .       END=3729583     GT:DP:GQ:MIN_DP:PL      0/0:20:51:19:0,51,765
    1       3729584 .       C       <NON_REF>       .       .       END=3729587     GT:DP:GQ:MIN_DP:PL      0/0:19:48:17:0,48,720
    1       3729588 .       C       <NON_REF>       .       .       END=3729592     GT:DP:GQ:MIN_DP:PL      0/0:17:39:17:0,39,585
    1       3729593 .       C       <NON_REF>       .       .       END=3729593     GT:DP:GQ:MIN_DP:PL      0/0:17:36:17:0,36,540
    1       3729594 .       A       <NON_REF>       .       .       END=3729594     GT:DP:GQ:MIN_DP:PL      0/0:16:33:16:0,33,495
    

    This .g.vcf.gz was created at an earlier time, with gatk 3.3 with the aforementioned pipe filter, but without apparent issues. See attached for one log at that time, zipped because it is quite long.

    EDIT:
    If I look at the differences in haplotypecaller commands (original on 3.3 with the pipe filter), the only differences besides ref, intervals (due to species) and bam files, obviously, is that the mouse haplocall also has '--group none' and '-l INFO' whereas the human haplocall had `-l DEBUG' and lacked the group

    Also for mouse there are these lines:
    libVectorLoglessPairHMM unpacked successfully
    Using vectorized implementation of PairHMM

    Whereas in the human I got:
    libVectorLoglessPairHMM not found in JVM library path - trying to unpack
    libVectorLoglessPairHMM unpacked successfully from GATK jar file

    Post edited by roel on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh never mind my earlier comment, that's expected for reference blocks.

    What's weird is the inconsistency in the annotations in variant records between the different files, eg one has

    1       5117372 .       T       C,<NON_REF>     69.18   .       MLEAC=2,0;MLEAF=1.00,0.00       GT:DP:GQ:PL:SB  1/1:3:9:96,9,0,96,9,96:0,0,1,2
    

    And the other has

    1       3729573 .       C       T,<NON_REF>     715.77  .       DP=18;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0  GT:AD:DP:GQ:PL:SB       1/1:0,18,0:18:54:744,54,0,744,54,744:0,0,0,18
    

    Regardless of the fact that they are at different sites, they should have the same annotations, which leads me to believe something went wrong in at least one of the cases (probably the first one). Files generated with the same version and the same commands should have the same sort of content.

    My recommendation is to generate the affected GVCF files from scratch. Make sure to use the same version for all the GVCFs.

  • roelroel AmsterdamMember
    edited July 2015

    What's weird is the inconsistency in the annotations in variant records

    Exactly, but note that these are two separate experiments and species. One is analysing human exome, the other is mouse exome, Mouse reproducibly fails. Haplotypecalling was done with the same settings per experiment, so your recommendation is not relevant.

    Does haplotypecalling on mouse exome work for you?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The program should work exactly the same way and output the exact same annotations when run on human or on mouse; it makes no difference at all.

    It's still odd that you're getting records with only MLEAC and MLEAF annotated. That to me is the biggest red flag. What happens if you redo the calling on just that region with the latest version?

  • roelroel AmsterdamMember

    Since you're avoiding my question, I'm guessing this wasn't ever tested. This is a regression. I'll leave you with a test case and if you have a real fix, then I am willing to try again, but until that time I'm not going to spend any more time on it. I have other things to do, and people are waiting on their data.

    These were the commands I run. You'll have to adapt the paths if you want to try. The bam files tested here with first 20K reads are attached. bam files are straight from bwa mem. Fastqs were adapterclipped with cutadapt, but otherwise nothing special.

    /opt/java/jdk1.7.0_79/bin/java -Xmx8G -XX:ParallelGCThreads=4 -jar /net/NGSanalysis/apps/gatkbuild/gatk-protected/protected/gatk-package-distribution/target/gatk-package-distribution-3.4.jar -T HaplotypeCaller -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -I 3460_23_K25_AAGGTACA_L002_tj-20k.bam --downsampling_type NONE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -ploidy 2 -minPruning 3 -maxReadsInRegionPerSample 250 -stand_call_conf 30 -stand_emit_conf 10 --read_filter FailsVendorQualityCheck --read_filter MappingQualityUnavailable --read_filter UnmappedRead --read_filter BadCigar -o 3460_23_K25_AAGGTACA_L002_tj-20k.g.vcf.gz -nct 1 --group none --max_alternate_alleles 20 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SureSelect/ex100/S0276129_Regions-mm10-ex100.bed 2> 3460_23_K25_AAGGTACA_L002_tj-20k.g.vcf.gz.log
    
     /opt/java/jdk1.7.0_79/bin/java -Xmx8G -XX:ParallelGCThreads=4 -jar /net/NGSanalysis/apps/gatkbuild/gatk-protected/protected/gatk-package-distribution/target/gatk-package-distribution-3.4.jar -T HaplotypeCaller -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -I 3460_24_K26_ACACAGAA_L002_tj-20k.bam --downsampling_type NONE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -ploidy 2 -minPruning 3 -maxReadsInRegionPerSample 250 -stand_call_conf 30 -stand_emit_conf 10 --read_filter FailsVendorQualityCheck --read_filter MappingQualityUnavailable --read_filter UnmappedRead --read_filter BadCigar -o 3460_24_K26_ACACAGAA_L002_tj-20k.g.vcf.gz -nct 1 --group none --max_alternate_alleles 20 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SureSelect/ex100/S0276129_Regions-mm10-ex100.bed 2> 3460_24_K26_ACACAGAA_L002_tj-20k.g.vcf.gz.log
    
     /opt/java/jdk1.7.0_79/bin/java -Xmx32G -XX:ParallelGCThreads=8 -jar /net/NGSanalysis/apps/gatkbuild/gatk-protected/protected/gatk-package-distribution/target/gatk-package-distribution-3.4.jar -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o test_gatk3_4_haplotype-stage-j.vcf -ploidy 2 --max_alternate_alleles 20 --dbsnp /net/NGSanalysis/ref/Mus_musculus.GRCm38/snp/mgp.v5.merged.snps_all.dbSNP142.vcf.gz --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SureSelect/ex100/S0276129_Regions-mm10-ex100.bed -V:3460_23_K25_AAGGTACA_L002_tj-20k,VCF 3460_23_K25_AAGGTACA_L002_tj-20k.g.vcf.gz -V:3460_24_K26_ACACAGAA_L002_tj-20k,VCF 3460_24_K26_ACACAGAA_L002_tj-20k.g.vcf.gz 2> test_gatk3_4_haplotype-stage-j.vcf.log
    

    and at this point you'll notice within 15 minutes it fails.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @roel
    Hi,

    Can you upload the reference you used? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • roelroel AmsterdamMember
    edited September 2015

    Hi Sheila,

    Thank you for looking into this. With regression I meant that it used to work before, 3.1.11, but it does not now. The reference can be found here:
    http://ftp.ensembl.org/pub/release-76/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
    I have a dict file attached in case you want to verify. The intervals (bed) file is not strictly necessary to trigger the error.
    so you should also see it without this:

    --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SureSelect/ex100/S0276129_Regions-mm10-ex100.bed
    

    let me know if you need anything else.
    Roel

  • roelroel AmsterdamMember

    (uploading the raw dict wasn't allowed)

  • roelroel AmsterdamMember

    Thanks for the confirmation. What the reason was our 3.1.11 build faltered may never be clear, we just decided to use the later version.

Sign In or Register to comment.