Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

GQ=0 but GT=0/0

GATK calls for SNP rs429358 don't look quite right in our project level vcf. This is one of the two SNPs that define the APOE e2,e3 and e4 (so just 3 haplotypes in region). About 1% of the 5000 samples have GQ=0 with a GT=0/0 and DP >30 . The nearby SNP rs7412 does not have this pattern (instead all GQ=99 for DP >30). . With a GQ=0, I was expecting a './.' instead. However,The 0/0 looks correct based on the IGV view of the bam in this region

Is this a bug or is something else going on?

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @jfarrell,

    Remember that GQ refers to genotype quality, which is trying to distinguish between ploidied genotypes, e.g. 0/0, 0/1, 1/1, etc for a diploid site. The confidence for a site being variant can be high even with a low GQ. For these samples, what is the QUAL score for these sites in the GVCF? Are you using the new qual model (-new-qual)? If you spot-check these 1% samples for these sites by visualizing the reads in IGV, do the calls makes sense?

    If the calls make sense, then it is up to you how you would like such GQ=0 sites to be annotated. You may find Tutorial#12350 of interest.

  • jfarrelljfarrell Member ✭✭

    I did view several bams with a GQ=0 and DP >40. They were clearly homozygous reference. 40+ reads with no variant. The GT='0/0' is accurate. This GT 0/0 is also consistent with the APOE genotypes we have. So the GQ calculation is the problem.

    So to further check, I ran freebayes on those bams with GQ=0 and high read depth. The GQ calculated by that program was in the 140-150 range (it does not round down to 99). So it appears that the GATK pipeline should be setting the GQ to 99 and not to 0 for these samples. I am sure there are plenty of examples of high read depth and poor GQ but this is not one of them.

    I don't have access to the gVCF. This was run with GATK 3.7 so I expect the experimental --new-qual parameter was not used. The pVCF shows the site passed.

    chr19 44908684 rs429358 T C 997935 PASS

    The GQ=0 for these samples are clearly incorrect. It appears to only be an issue for homozygous reference calls and not the heterozygous calls. The GQ should be 99 and not 0 for these high depth homozygous reference calls.

    If a project had a QC filter to set any GT with a GQ <20 to missing './.', the filter would incorrectly set many very high quality calls to './.' . Those set to missing would be excluded from association testing and reduce power.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @jfarrell,

    The GQ should be 99 and not 0 for these high depth homozygous reference calls.

    To summarize, although the calls are correctly hom-ref, the concern is that these sites would be inappropriately handled at the sample level based on their GQ0. These results were obtained with GATK3.7.

    Could you do us a favor? Could you see if using GATK v4.0.11.0 improves the GQ at these hom-ref sites? If it doesn't, we will want to investigate what exactly is going on and would ask if you could donate a small snippet of your data to recapitulate the results. Instructions for uploading data are at https://software.broadinstitute.org/gatk/guide/article?id=1894.

    I have heard from the developers that GATK4 recently incorporated features to help with calling hom-ref sites with confidence. If we peruse the release notes, we find these related changes:

    v4.0.10.1

    #4969

    v4.0.9.0

    #5172
    #5171

    As an aside, -new-qual was backported to GATK3 and you should be able to use it with GATK3.7/3.8. The parameter is -newQual or --useNewAFCalculator.

    Thank you.

  • jfarrelljfarrell Member ✭✭

    I tested both GATK 3.7 and Gatk 4.0.11.0 with and without the -newQual parameter on the samples with GQ=0 for the genotype on the samples.

    The results show the bug still exists when creating a g.vcf.gz file using either GATK versions with or without specifying the -newQual parameter. Here are some examples from the g.vcf.gz

    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:44:0:41:0,0,1097
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:47:0:44:0,0,1003
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:44:0:44:0,0,778
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:66:0:65:0,0,1056
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:39:0:38:0,0,893
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:55:0:55:0,0,1394
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:51:0:50:0,0,1294
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:46:0:45:0,0,863
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:48:0:47:0,0,1092
    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:53:0:51:0,0,1216
    

    Interestingly, when creating a vcf file (not a g.vcf), the GQ values looked accurate. GQ is often set to 99 for samples with DP > 40. No samples had a GQ=0. Here are a few examples of the vcf output.

                    GT:AD:DP:GQ:PL
                    0/0:37,0:37:99:0,111,1236
                    0/0:40,0:40:99:0,120,1357
                    0/0:34,0:34:99:0,102,1161
                    0/0:48,0:48:99:0,144,1648
                    0/0:33,0:33:99:0,99,1036
                    0/0:47,0:47:99:0,141,1693
                    0/0:42,0:42:99:0,126,1410
                    0/0:37,0:37:99:0,111,1215
                    0/0:39,0:39:99:0,117,1321
                    0/0:41,0:41:99:0,123,1385
                    0/0:53,0:53:99:0,159,1744
                    0/0:45,0:45:99:0,135,1529
    

    I will upload a file for replicating the issue later today.

  • jfarrelljfarrell Member ✭✭

    The test file gq0_bug.tar.gz has been uploaded.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @jfarrell,

    I've received and processed your test data.

    It appears that you are referring to GQ0 sites in GVCF outputs and not the final VCF callset. The site in question, chr19 44908684, is invariant in both the GVCF and in the final VCF. Your concern is with the GQ notation for the GVCF and you say that the final VCF (produced using straight-up HaplotypeCaller in default mode) resolves everything correctly. Please let me know if I've misunderstood anything.

    There is another tool that is required to genoytpe GVCFs and this tool is called GenotypeGVCFs. We are not very original with naming our tools.

    I think you may find Article#11004 helpful towards understanding the differences between GVCFs and VCFs.

    Let's interpret the GVCF together. To be clear, at this stage, there has been no genotyping and the GVCF mode of HaplotypeCaller is giving us the possibility of variance for the 5bp block starting at 44,908,684 and ending at 44,908,688. You show the GVCF blocks for ten different samples for this same interval. Here is the GVCF record for the top sample:

    chr19   44908684        .       T       <NON_REF>       .       .       END=44908688    GT:DP:GQ:MIN_DP:PL      0/0:44:0:41:0,0,1097
    

    Here, the first PL value of 0 refers to the likelihood of REF/REF, the second PL value of 0 refers to the likelihood of REF/NON_REF and the final PL value of 1097 refers to the likelihood of NON_REF/NON_REF. After GenotypeGVCFs, during which step we actually genotype, we see that this block resolves to a HOM_REF call by virtue of the lack of records for the region and in fact for the entirety of the test dataset you shared. I hope this is helpful.

  • jfarrelljfarrell Member ✭✭

    I know the GT=0/0 but with high quality.

    The GQ for this site are very different depending on whether HaplotyperCaller directly joint genotypes the samples and creates a vcf file or the multistep process of a gvcf/importdb/GenotypeGVCF creates the vcf. The gvcf multistep approach results in a clearly erroneous GQ=0. While the direct genotyping to a vcf has the more accurate GQ=99. These calls were verified with Freebayes.

    Well I am actually very familiar with the gvcf format. With GQ=0 and 0/0 for each sample in the g.vcf, GenotypeGVCF would not output a record. So why bother running it . To verify that, GenotypeGVCF was run any way and indeed no record was produced.

    I now added one heterozygous sample with a GQ=99. The first sample in the output is the heterozygous. So now there is one alt allele with a high quality GQ and the record is now output. As in the g.vcf file all other samples have GQ=0. These should not have GQ=0 but mostly GQ=99 as demonstrated when HaplotypeCaller outputs a vcf file instead.

    So the bug is clearly in the creation of the g.vcf file with incorrect GQ=0 values which is why those results were posted for the developers benefit. They are GQ=99 for these calls when HaplotypeCaller directly generates a VCF. One set of steps produce a GQ=0 and another GQ=99. Both can not be correct.

    Here is the VCF record for that site ....

    AC=1;AF=8.621e-03;AN=116;BaseQRankSum=-8.310e-01;DP=2213;ExcessHet=41.0061;FS=1.957;InbreedingCoeff=-0.3410;MLEAC=10;MLEAF=0.086;MQ=60.00;MQRankSum=0.00;QD=12.17;ReadPosRankSum=0.616;SOR=1.080
    GT:AD:DP:GQ:PL
    0/1:9,8:17:99:227,0,272
    0/0:41,0:41:0:0,0,1097
    0/0:51,0:51:0:0,0,1216
    0/0:61,0:61:0:0,0,1373
    0/0:54,0:54:0:0,0,962
    0/0:49,0:49:0:0,0,1156
    0/0:53,0:53:0:0,0,729
    0/0:44,0:44:0:0,0,1161
    0/0:38,0:38:0:0,0,963
    0/0:68,0:68:0:0,0,1518
    0/0:33,0:33:0:0,0,841
    0/0:54,0:54:0:0,0,687
    0/0:44,0:44:0:0,0,1003
    0/0:33,0:33:0:0,0,709
    0/0:54,0:54:0:0,0,580
    0/0:31,0:31:0:0,0,790
    0/0:36,0:36:0:0,0,843
    0/0:49,0:49:0:0,0,978
    0/0:34,0:34:0:0,0,669
    0/0:39,0:39:0:0,0,898
    0/0:60,0:60:0:0,0,1270
    0/0:48,0:48:0:0,0,908
    0/0:30,0:30:0:0,0,780
    0/0:44,0:44:0:0,0,778
    0/0:24,0:24:0:0,0,531
    0/0:40,0:40:0:0,0,714
    0/0:50,0:50:0:0,0,794
    0/0:44,0:44:0:0,0,953
    0/0:38,0:38:0:0,0,953
    0/0:45,0:45:0:0,0,1081
    0/0:40,0:40:0:0,0,967
    0/0:49,0:49:0:0,0,604
    0/0:38,0:38:0:0,0,675
    0/0:55,0:55:0:0,0,1111
    0/0:65,0:65:0:0,0,1056
    0/0:32,0:32:0:0,0,734
    0/0:21,0:21:0:0,0,551
    0/0:51,0:51:0:0,0,1335
    0/0:56,0:56:0:0,0,1549
    0/0:40,0:40:0:0,0,829
    0/0:35,0:35:0:0,0,848
    0/0:63,0:63:0:0,0,1409
    0/0:10,0:10:0:0,0,229
    0/0:6,0:6:0:0,0,129
    0/0:5,0:5:0:0,0,114
    0/0:38,0:38:0:0,0,893
    0/0:6,0:6:0:0,0,139
    0/0:5,0:5:0:0,0,45
    0/0:9,0:9:0:0,0,75
    0/0:3,0:3:0:0,0,35
    0/0:16,0:16:0:0,0,199
    0/0:22,0:22:0:0,0,526
    0/0:32,0:32:0:0,0,924
    0/0:13,0:13:0:0,0,313
    0/0:55,0:55:0:0,0,1394
    0/0:50,0:50:0:0,0,1294
    0/0:45,0:45:0:0,0,863
    0/0:47,0:47:0:0,0,1092

    Here are the results when the vcf is created directly from HaplotypeCaller bypassing the creation of a g.vcf. Not the value of the GQ. Some of the low coverage sites have 0//1.

    AC=7;AF=0.061;AN=114;BaseQRankSum=-6.147;DP=1846;ExcessHet=3.8592;FS=0.000;InbreedingCoeff=-0.0640;MLEAC=6;MLEAF=0.053;MQ=60.00;MQRankSum=0.000;QD=4.52;ReadPosRankSum=-0.781;SOR=2.833
    GT:AD:DP:GQ:PL
    0/0:37,0:37:99:0,111,1236
    0/0:40,0:40:99:0,120,1357
    0/0:34,0:34:99:0,102,1161
    0/0:49,0:49:99:0,147,1673
    0/0:33,0:33:99:0,99,1036
    0/0:48,0:48:99:0,144,1728
    0/0:42,0:42:99:0,126,1410
    0/0:37,0:37:99:0,111,1215
    0/0:39,0:39:99:0,117,1311
    0/0:42,0:42:99:0,126,1419
    0/0:53,0:53:99:0,159,1744
    0/0:45,0:45:99:0,135,1529
    0/0:44,0:44:99:0,132,1419
    0/0:38,0:38:99:0,114,1299
    0/0:37,0:37:99:0,111,1205
    0/0:34,0:34:99:0,102,1151
    0/0:57,0:57:99:0,171,1826
    0/0:27,1:28:49:0,49,904
    0/0:41,0:41:99:0,123,1364
    0/0:28,0:28:84:0,84,933
    0/0:36,0:36:99:0,108,1171
    0/0:29,0:29:87:0,87,987
    0/0:31,0:31:93:0,93,997
    0/0:37,0:37:99:0,111,1266
    0/0:28,2:30:70:0,70,914
    0/0:36,0:36:99:0,108,1230
    0/0:49,0:49:99:0,147,1613
    0/0:38,1:39:82:0,82,1231
    0/0:26,0:26:78:0,78,869
    0/0:22,0:22:66:0,66,734
    0/0:28,1:29:52:0,52,878
    0/0:36,0:36:99:0,108,1221
    0/0:39,0:39:99:0,117,1315
    0/0:31,1:32:86:0,86,1008
    0/0:38,0:38:99:0,114,1286
    0/0:37,0:37:99:0,111,1266
    0/0:35,0:35:99:0,105,1131
    0/0:32,0:32:96:0,96,1071
    0/0:46,0:46:99:0,138,1508
    0/0:27,0:27:81:0,81,918
    0/0:19,0:19:57:0,57,620
    0/0:45,0:45:99:0,135,1474
    0/0:50,0:50:99:0,150,1687
    0/0:29,0:29:87:0,87,968
    0/0:30,0:30:90:0,90,992
    0/0:55,1:56:99:0,158,1830
    0/1:8,2:10:19:19,0,248
    0/1:5,1:6:17:17,0,145
    0/1:4,1:5:10:10,0,124
    0/1:5,1:6:17:17,0,155
    0/1:3,2:5:34:34,0,79
    0/1:5,3:8:45:45,0,150
    0/1:2,1:3:26:26,0,61
    0/0:13,0:13:39:0,39,431
    0/0:17,0:17:51:0,51,571
    0/0:28,0:28:84:0,84,993
    0/0:10,0:10:30:0,30,328

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @jfarrell,

    The data you submitted in your bug report covers a region with no calls after GenotypeGVCFs. So if you still think there is a bug and want us to look into it you'll need to submit data that actually illustrates a genotyped call that produces a GQ0 site. Then we can get into the differences between the GVCF workflow vs. joint-calling in default mode in HaplotypeCaller.

    Again, I reiterate that changes in v4.0.9.0 increase GQ scores for HOM_REF sites. I have been in the process of updating workshop tutorial materials and in fact two days ago added this section:

    So I think you'll find the situation with GQ0 HOM_REF sites improved with the latest GATK (the worksheet uses v4.0.11.0).

  • jfarrelljfarrell Member ✭✭

    This variant is not near an indel so the that bug fix is not applicable. It is also not ambiguous in other programs (freebayes and Haplotype Caller directly outputing a vcf). It also is occuring in both the 3.7 and latest 4.0.11.0 version.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    edited December 3

    Hi @jfarrell

    As @shlee mentioned:

    you'll need to submit data that actually illustrates a genotyped call that produces a GQ0 site. Then we can get into the differences between the GVCF workflow vs. joint-calling in default mode in HaplotypeCaller.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    HI @jfarrell

    We have not heard from you in 2 business days and hence will be closing this issue now.

    Regards
    Bhanu

Sign In or Register to comment.