Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

False positive(?) variant calls of long insertion GATK 3.6

MaxMax Member
edited February 21 in Ask the GATK team

Hi!

first of all, I'm aware that the calling of long Indels is usually problematic. But since I just came across the issue and could not find any possible explanation for this, I wanted to ask if maybe some of you guys know anything about this observation:

chr3 195507858 . G T,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT 157204 VQSRTrancheINDEL99.00to99.90 AC=0,2;AF=0.237,0.053;AN=2;BaseQRankSum=8.08;ClippingRankSum=0;DP=44310;ExcessHet=34.8579;FS=2.569;InbreedingCoeff=-0.153;MLEAC=63,13;MLEAF=0.24,0.05;MQ=33.48;MQRankSum=-0.033;NEGATIVE_TRAIN_SITE;QD=19.7;ReadPosRankSum=-0.672;SOR=0.898;VQSLOD=-2.182;culprit=MQRankSum GT:AD:DP:GQ:PGT:PID:PL 2/2:60,0,0:60:99:.:.:7071,890,2640,496,194,0

As you can see, its a homozygous genotype 2/2. However, when looking at the AD fields, I cannot detect any evidence for the variant. In contrast, all reads seem to carry the reference allele. Based on this, its completely unclear to me how the caller concludes on this genotype.

In another sample it looks like this:
chr3 195507858 . G GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT144784 VQSRTrancheINDEL99.00to99.90 AC=1;AF=0.042;AN=2;BaseQRankSum=3;ClippingRankSum=0;DP=44395;ExcessHet=39.0155;FS=2.58;InbreedingCoeff=-0.1729;MLEAC=10;MLEAF=0.038;MQ=33.27;MQRankSum=-0.746;NEGATIVE_TRAIN_SITE;QD=17.95;ReadPosRankSum=-0.672;SOR=0.897;VQSLOD=-2.312;culprit=MQRankSum GT:AD:DP:GQ:PGT:PID:PL 0/1:89,0:92:99:.:.:4731,0,2164

Same issue, but now its suddenly heterozygous...

And in this sample:
chr3 195507858 . G GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT 156872 PASS AC=2;AF=0.061;AN=2;BaseQRankSum=0.967;ClippingRankSum=0;DP=42059;ExcessHet=33.8845;FS=3.403;InbreedingCoeff=-0.1488;MLEAC=14;MLEAF=0.053;MQ=34.6;MQRankSum=-1.012;QD=19.08;ReadPosRankSum=-0.692;SOR=0.827;VQSLOD=-0.3066;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL 1/1:39,27:66:99:.:.:6872,496,0

Although there is (in my opinion) evidence for a 0/1 genotype, its called as 1/1.

Would be great if someone could share his/her opinion on that. Maybe I'm just not up to date, but I think these specific genotypes really make no sense..

Thanks!

Max

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Max

    Is it possible to try this with the latest version of gatk4.1 to see if the problem persists?

  • MaxMax Member
    edited February 25

    Hi @bhanuGandham

    actually I was wrong concerning our original GATK version, it was 3.8.0 , not 3.6.0.
    Anyway, I just took another sample out of our cohort which showed this variant and called it once more using 3.8 and 4.1
    Pipeline until this step was simply BWA+Picard DuplicateRemoval. I used the same BAM as Input for HC 3.8 and 4.1

    VCF Output for 3.8

    chr3    195507858       .       G       GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT       2190.73 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-1.730e-01;ClippingRankSum=0.00;DP=259;ExcessHet=3.0103;FS=1.874;MLEAC=2;MLEAF=1.00;MQ=53.87;MQRankSum=-5.400e-01;QD=12.24;ReadPosRankSum=-3.824e+00;SOR=1.336   GT:AD:DP:GQ:PL  1/1:175,4:180:99:2228,146,0
    

    VCF Output for 4.1

    chr3    195507858       .       G       GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT       2099.06 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-5.990e-01;DP=261;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=52.05;MQRankSum=-6.150e-01;QD=14.09;ReadPosRankSum=-1.262e+00;SOR=0.389        GT:AD:DP:GQ:PL  1/1:145,4:149:99:2113,151,0
    

    Although the AD values change a bit, the overall problem seems to be persistent. Its still called as homozygous 1/1 although it should be 0/0.

    So another thing I recognized was that this specific site has a lot of alternative alleles. Maybe its related to this?
    So in case this helps, here are the g.vcf entries that show the additional alternative alleles:

    GATK 3.8

    chr3    195507858   .   G   GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGGGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGCGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGCGTGGGGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCAGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGGGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCAGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGTCAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,<NON_REF> 2190.73 .   BaseQRankSum=-0.173;ClippingRankSum=0.000;DP=259;ExcessHet=3.0103;MLEAC=2,0,0,0,0,0,0,0;MLEAF=1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00;MQRankSum=-0.540;RAW_MQ=751569.00;ReadPosRankSum=-3.824   GT:AD:DP:GQ:PL:SB   1/1:175,4,0,0,1,0,0,0,0:180:99:2228,146,0,448,260,2108,549,374,2083,2887,547,362,2426,3090,4234,1145,304,697,858,873,2484,1396,533,2404,2434,2844,2736,4526,663,536,2328,2941,3295,1004,2678,4316,2676,800,2657,2867,3233,2798,4520,3117,6022:70,105,1,4
    

    GATK 4.1

    chr3    195507858   .   G   GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGCGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCAGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGGGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCAGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGGGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,GGTGGATACTGAGGAAGTGTCGGTGTCAGGAAGAGGGGTGGCGTGACCGGTGGATGCCGAGGAAGCGTCGGTGACAGGAAGAGGGGTGGTGTCACCTGTGGATACTGAGGAAAAGCTGGTGACAGGAAGAGGGGTGGCGTGACCT,<NON_REF>2099.06    .   BaseQRankSum=-0.599;DP=261;ExcessHet=3.0103;MLEAC=0,0,2,0,0,0,0;MLEAF=0.00,0.00,1.00,0.00,0.00,0.00,0.00;MQRankSum=-0.615;RAW_MQandDP=707028,261;ReadPosRankSum=-1.262  GT:AD:DP:GQ:PL:SB   3/3:145,0,0,4,0,0,0,0:149:99:2113,488,2575,1199,843,2982,151,328,382,0,1637,2721,3565,906,7815,436,1926,750,240,2682,1980,577,2627,969,449,3412,2162,3790,2643,2677,3250,844,5259,2524,2908,6215:59,86,2,2
    

    So then I also looked at the HC 4.1 BAM using IGV (created by using -bamout):
    Position 195507858

    Position 195507859

    So after looking at Pos. 195507859 it seems as if there is support for the actual Insertion (the G at 195507858 did not change ) starting there, while Pos. 195507858 supports the AD values of the vcf entry. So taken together, I guess the genotype is probably correct but simply the AD values do not reflect the reads that actually cover the variant..
    Could that be correct?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Max

    I had the dev team look into this and this is what they have to say:
    AD represents the number of "clean" reads supporting the genotype - in the first example the user had pileup of 44310 reads, where only 60 clean reads supported to 0/0 genotype, but the other 40k reads still might have evidence for genotype 2/2, and that is represented in PL scores. For the last example user showed it looks a little strange - one helpful thing would be to look at the original BWA alignment instead of bamout to see what's going on

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    We haven't heard from the user in more than two business days. The user has been notified and this ticket is now closed.

  • MaxMax Member

    Hi @bhanuGandham ,

    sorry for the delay. In case you still want to look into this, here's the requested screenshot of the original BAM file that was used as input for the HC:

Sign In or Register to comment.