Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenotypeGVCFs - Unexpected homozygous/heterozygous genotypes near indels
I found that GenotypeGVCFs in GVCF mode can lead to an unexpected homozygous or heterozygous genotypes when one SNP is called within an indel.
In the following example I have the individual1 showing only an indel at position 183095185. The second individual shows also an indel at position 1830095185 but also one SNP at the position 183095187 (even if this individual seems to be homozygous because only 2 reads support the laternate allele).
Executing GenotypeGVCFs on those 2 individuals lead to an erroneous calling with weird allele depths for the individual 1 for the SNP at the position 183095187 : 1/1 with an AD of 1,0 (the second individual remains homozygous for the reference allele at the SNP).
Individual 1 GVCF :
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind1 1 183095179 . C <NON_REF> . . END=183095179 GT:DP:GQ:MIN_DP:PL 0/0:49:39:49:0,39,585 1 183095180 . T <NON_REF> . . END=183095182 GT:DP:GQ:MIN_DP:PL 0/0:53:36:52:0,36,540 1 183095183 . A <NON_REF> . . END=183095184 GT:DP:GQ:MIN_DP:PL 0/0:53:33:53:0,33,495 1 183095185 rs79025658 TAAA T,TA,<NON_REF> 1275.73 . BaseQRankSum=0.803;ClippingRankSum=0.380;DB;DP=67;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;MQ=59.62;MQ0=0;MQRankSum=-0.972;ReadPosRankSum=1.141 GT:AD:DP:GQ:PL:SB 1/2:1,7,33,0:41:36:1313,855,951,155,0,36,1010,781,136,914:1,0,29,4 1 183095189 . A <NON_REF> . . END=183095189 GT:DP:GQ:MIN_DP:PL 0/0:68:0:68:0,0,1962 1 183095190 . A <NON_REF> . . END=183095233 GT:DP:GQ:MIN_DP:PL 0/0:84:99:66:0,120,1800
Individual 2 GVCF :
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind2 1 183095182 . T <NON_REF> . . END=183095182 GT:DP:GQ:MIN_DP:PL 0/0:33:51:33:0,51,765 1 183095183 . ACT A,<NON_REF> 0 . BaseQRankSum=-2.106;ClippingRankSum=-0.311;DP=38;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.28;MQ0=0;MQRankSum=1.830;ReadPosRankSum=0.943 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:34,2,0:36:68:0|1:183095183_ACT_A:0,68,1535,111,1541,1585:29,5,0,0 1 183095185 rs398049833 TAA T,TA,<NON_REF> 283.73 . BaseQRankSum=-0.145;ClippingRankSum=-0.894;DB;DP=42;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.26;MQ0=0;MQRankSum=-0.561;ReadPosRankSum=1.995 GT:AD:DP:GQ:PL:SB 0/1:14,13,3,0:30:99:321,0,529,223,286,549,290,349,521,582:12,2,10,3 1 183095187 . A T,<NON_REF> 0 . BaseQRankSum=-2.069;ClippingRankSum=0.350;DP=42;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.26;MQ0=0;MQRankSum=0.287;ReadPosRankSum=1.079 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:37,2,0:39:68:0|1:183095183_ACT_A:0,68,1535,111,1541,1585:29,8,0,0 1 183095188 . A <NON_REF> . . END=183095188 GT:DP:GQ:MIN_DP:PL 0/0:38:0:38:0,0,453 1 183095189 . A <NON_REF> . . END=183095200 GT:DP:GQ:MIN_DP:PL 0/0:43:99:40:0,99,1485
GenotypeGVCFs output :
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind1 ind2 1 183095185 . TAA T 1604.19 . AC=3;AF=0.750;AN=4;BaseQRankSum=0.803;ClippingRankSum=0.380;DP=109;FS=0.000;GQ_MEAN=220.00;GQ_STDDEV=142.84;MLEAC=2;MLEAF=0.500;MQ=60.26;MQ0=0;MQRankSum=-5.610e-01;NCC=0;QD=26.30;ReadPosRankSum=2.00;SOR=0.765 GT:AD:DP:GQ:PL 1/1:1,33:41:99:1277,119,0 0/1:14,13:30:99:321,0,529 1 183095187 . A T 364.42 . AC=2;AF=0.500;AN=4;BaseQRankSum=-2.069e+00;ClippingRankSum=0.350;DP=109;FS=4.440;GQ_MEAN=82.00;GQ_STDDEV=19.80;MLEAC=2;MLEAF=0.500;MQ=60.26;MQ0=0;MQRankSum=0.287;NCC=0;QD=34.24;ReadPosRankSum=1.08;SOR=1.395 GT:AD:DP:GQ:PGT:PID:PL 1/1:1,0:41:96:.:.:399,96,0 0/0:37,2:39:68:0|1:183095183_ACT_A:0,68,1535
I have found that this error occurs many time and, each time, it seems that a SNP and an indel overlap in 2 different individuals. Sometimes the resulting error calling is homozygous for the alternate allele (1/1), sometimes heterozygous but always with almost no allele depths for alternate allele. When I look at bam resulting file from HaplotypeCaller, there has been some reassembly but there's no read supporting an alternate allele for the first individual at the SNP position (in my example, at position 183095185).
I think this strange behaviour may be the problem described in this post http://gatkforums.broadinstitute.org/discussion/5021/unexpected-heterozygous-genotype-produced-by-genotypegvcfs
Finally, I used GATK-3.3 and the problem still persists with the nightly build of 2015-01-16-g296288a.