Unexpected genotypes in GenotypeGVCF output
Hello GATK Team,
I have 21 bam files that I ran through
HaplotypeCaller in GVCF mode followed by
GenotypeGVCF, using Version=3.4-0-g7e26428. I found a few entries that I am having a difficult time understanding.
Here is one of the entries in question (quick note: each sample is from a pool of 3-5 animals, explaining the high ploidy):
chr1 88988835 rs396411987 C G 226099.87 . AC=61;AF=0.455;AN=134;BaseQRankSum=-1.176e+00;ClippingRankSum=0.054;DB;DP=29307;FS=0.000;MLEAC=61;MLEAF=0.455;MQ=59.98;MQRankSum=0.603;QD=11.32;ReadPosRankSum=1.30;SOR=0.696 GT:AD:DP:GQ:PL 0/0/0/0/0/1/1/1/1/1:396,449,0,0:845:28:9249,2087,965,413,120,0,28,219,649,1588,7879 0/1/1/1/1/1/1/1/1/1:275,2150,0,0:2425:99:52058,17674,11483,7903,5424,3571,2143,1051,287,0,3240 0/0/0/1/1/1/1/1:626,1134,0,0:1760:99:24237,5532,2600,1118,316,0,197,1292,11237 0/0/1/1/1/1/1/1:225,824,0,0:1049:99:18893,5121,2835,1577,772,257,0,116,3439 ./././././././././.:0,0 ./././././././././.:0,0 ./././././././.:0,0 0/0/0/0/0/0/0/1:824,105,0,0:929:99:1309,0,239,705,1369,2290,3644,6013,20091 0/0/0/0/0/1/1/1:786,499,0,0:1285:99:9169,1197,249,0,139,633,1609,3600,16598 0/0/1/1/1/1:557,1132,0,2:1691:99:24696,4545,1719,433,0,563,9784 0/0/0/0/0/0/0/1:922,140,0,0:1062:99:1825,0,202,685,1401,2410,3908,6544,22289 0/0/0/0/1/1/1/1/1/1:650,828,0,0:1478:26:16909,4072,1966,903,311,26,0,254,906,2400,12402 ./././././././.:0,0 0/0/0/0/0/0/0/0/1/1:846,179,0,0:1025:95:2546,95,0,178,520,1015,1689,2617,3986,6389,20368 0/0/0/0/1/1/1/1:875,986,0,0:1861:99:20237,3745,1410,380,0,136,885,2819,17626 0/0/0/0/0/0/0/0:570,0:570:23:0,23,50,82,120,170,241,361,1800 0/0/0/0/0/1/1/1:798,604,0,0:1402:20:11395,1675,423,0,20,429,1344,3302,16391 0/0/0/0/1/1/1/1:644,730,0,0:1374:96:14884,2778,1049,285,0,96,643,2062,12779 0/0/0/0/0/1/1/1:613,417,0,0:1030:74:7703,1065,243,0,74,433,1174,2710,12915 0/0/0/1/1/1/1/1:239,513,0,0:752:13:11159,2667,1309,604,198,0,13,378,4191 ./././././././.:0,0
So, for the samples with genotypes present, AD is obviously quite high (in the 1000s, generally). It surprised me, then, that some samples don't have information (represented by ./././././././.). I went back to the original sample gvcf files and pulled the entries from a subset that were represented by ./././././././.:
group14.g.vcf:chr1 88988835 rs396411987 C G,A, 16897.68 . BaseQRankSum=-0.709;ClippingRankSum=0.843;DB;DP=1729;MLEAC=5,0,0;MLEAF=0.500,0.00,0.00;MQ=59.96;MQRankSum=0.621;ReadPosRankSum=1.987 GT:AD:DP:GQ:SB 0/0/0/0/0/1/1/1/1/1:876,844,2,0:1722:99:443,433,424,422
group15.g.vcf:chr1 88988835 rs396411987 C G,T, 690.35 . BaseQRankSum=-1.311;ClippingRankSum=-0.019;DB;DP=1168;MLEAC=1,0,0;MLEAF=0.125,0.00,0.00;MQ=59.98;MQRankSum=0.603;ReadPosRankSum=3.624 GT:AD:DP:GQ:SB 0/0/0/0/0/0/0/1:1082,78,4,0:1164:99:542,540,44,38
A consistent difference between samples that are included and those that are not is the presence of multiple alternative alleles in the excluded samples (G,A and G,T in the above example). Is this the source of my troubles? Is there a way of forcing GATK to include those samples in the VCF, especially given that the support for the third allele seems pretty weak (2 reads out of ~1722 in sample 14 above)? It seems to me that these samples should reasonably be included in the output of GenotypeGVCF.
Apologies if the answer is somewhere in the forums/guide - I did check, but didn't find anything.