We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Bug - wrong genotype called by HaplotypeCaller

Hi,
HaplotypeCaller in gatk v4.0.10.1 is giving markedly wrong genotypes in the GVCF file. See below for an example, where a locus with 17 reference reads and 33 ALT reads is being called as homozygous ALT (1/1).

This is also seen in many other loci. This must be a bug and needs to be investigated, because this would affect every GATK variant calling pipeline currently being used. Let me know what info you need and I can send it.

chr1 1482055 . G A, 1095.03 . AS_RAW_BaseQRankSum=|0.7,1|NaN;AS_RAW_MQ=28849.00|78195.00|0.00;AS_RAW_MQRankSum=|2.1,1|NaN;AS_RAW_ReadPosRankSum=|-1.4,1|NaN;AS_SB_TABLE=16,1|32,1|0,0;BaseQRankSum=0.719;DP=50;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQRankSum=2.107;RAW_MQandDP=107044,50;ReadPosRankSum=-1.342 GT:AD:DP:GQ:PL:SB 1/1:17,33,0:50:99:1109,99,0,1109,99,1109:16,1,32,1

Answers

  • GERGER Member

    Hi, Here is the BAM file with reads from the region. And here is the HaplotypeCaller command:

    gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
    HaplotypeCaller \
    -R ~{ref_fasta} \
    -I ~{input_bam} \
    -L ~{interval_list} \
    -O ~{output_file_name} \
    -contamination ~{default=0 contamination} \
    -G StandardAnnotation -G StandardHCAnnotation ~{true="-G AS_StandardAnnotation" false="" make_gvcf} \
    -new-qual \
    -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
    ~{true="-ERC GVCF" false="" make_gvcf} \
    ~{bamout_arg}

  • GERGER Member
    edited November 2019

    Also, in the raw BAM file (after removing duplicate reads, secondary alignments, etc), there are 42 REF reads and 43 ALT reads. So it is not clear where HaplotypeCaller got 17 and 33 reads, respectively from. See attached.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @GER

    Please rerun this data with the latest version of GATK 4.1.4.0 and let me know if the issue persists.

  • GERGER Member

    Hi, I reran with GATK 4.1.40 and same problem (see output below). It is still calling "1/1" ALT homozygous with the highest possible GQ (99) for a variant that is clearly not homozygous, with 17 REF reads out of 50 total reads.

    Must be some fundamental bug in HaplotypeCaller, which is concerning.

    chr1 1482055 . G A, 1095.03 . AS_RAW_BaseQRankSum=|0.7,1|NaN;AS_RAW_MQ=28849.00|78195.00|0.00;AS_RAW_MQRankSum=|2.1,1|NaN;AS_RAW_ReadPosRankSum=|-1.4,1|NaN;AS_SB_TABLE=16,1|32,1|0,0;BaseQRankSum=0.719;DP=50;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQRankSum=2.107;RAW_MQandDP=107044,50;ReadPosRankSum=-1.342 GT:AD:DP:GQ:PL:SB 1/1:17,33,0:50:99:1109,99,0,1109,99,1109:16,1,32,1

  • GERGER Member

    Hi, Just to update, the bug is more concerning than I previously thought. I reran this sample with joint genotyping (HaplotypeCaller) as a trio with the individual's father and mother, and it still outputs the wrong genotype. Albeit with a lower JP than JL, meaning it took some of the trio into account and downgraded the quality.

    But simple common sense says that something must be wrong. A sample that has 16 REF alleles and 30 ALT alleles should not be called as homozygous ALT, doesn't matter what qualities or other weird stuff is going on in the algorithm. And it is clearly heterozygous looking at the raw data (see above screenshot) and based on the parents.

    Any ideas what is the bug/issue with HaplotypeCaller? You should have all the needed raw data attached above to investigate. Thanks.

    chr1 1482055 . G A 2507.91 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=1.294;DP=225;ExcessHet=1.5490;FS=3.883;MLEAC=3;MLEAF=0.500;MQ=50.20;MQRankSum=1.982;PG=47,21,0;QD=16.18;ReadPosRankSum=-0.963;SOR=1.573 GT:AD:DP:GQ:JL:JP:PL:PP 0/0:67,0:67:99:89:30:0,200,2198:0,114,2091 0/1:53,56:109:99:89:30:1523,0,1406:1609,0,1385 1/1:16,30:46:51:89:30:1001,90,0:988,51,0

  • GERGER Member

    Hi all,
    I found the problem. I was using HaplotypeCaller with the -contamination option per the GATK github workflow. However VerifyBamID that generates the contamination estimate was producing incorrect contamination estimates that were very high, between 0.3-0.5. This was causing significant discounting of the evidence from ALT reads and producing wrong genotypes. Simply changing contamination to 0 (i.e. default value) fixed the issue. See new genotype below.

    But note - it is unclear why VerifyBamID is producing wildly incorrect contamination estimates. I know they are incorrect because we are actually using unique dual indexing for which contamination should be far better than any standard library prep. What is different about these libraries relative to our prior libraries is that they were sequenced on a Novaseq.

    This makes me suspect that some kind of underlying artifact or difference in Novaseq data is causing VerifyBamID to give very incorrect contamination estimates. This in turn would cause HaplotypeCaller to give very wrong genotypes.

    I think the Broad GATK pipeline team needs to carefully evaluate VerifyBamID on the data from the newer sequencing instruments. Otherwise, many people's data may be significantly affected. This is a potentially significant issue.

    chr1 1482055 . G A, 953.60 . AS_RAW_BaseQRankSum=|0.7,1|NaN;AS_RAW_MQ=28849.00|78195.00|0.00;AS_RAW_MQRankSum=|2.1,1|NaN;AS_RAW_ReadPosRankSum=|-1.4,1|NaN;AS_SB_TABLE=16,1|32,1|0,0;BaseQRankSum=0.719;DP=50;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=2.107;RAW_MQandDP=107044,50;ReadPosRankSum=-1.342 GT:AD:DP:GQ:PL:SB 0/1:17,33,0:50:99:961,0,361,1010,460,1470:16,1,32,1

Sign In or Register to comment.