HaplotypeCaller - single block subsitution or multiple SNVs?
Hello GATK team,
I notice that HaplotypeCaller will sometimes call a single block substitution variant where UnifiedGenotyper would call two separate SNVs, e.g.:
chr12 93147901 . CGTGCCA TGTGCCT 6490.01 . AC=4;AF=0.667;AN=6;ActiveRegionSize=260;ClippingRankSum=0.815;DP=153;EVENTLENGTH=0;FS=4.419;MLEAC=4;MLEAF=0.667;MQ=70.00;MQRankSum=1.712;NVH=2;NumHapAssembly=12;NumHapEval=7;QD=57.95;QDE=28.98;ReadPosRankSum=-1.668;TYPE=SNP;extType=MNP GT:GQ:PL 1/1:99:3489,175,0 0/0:99:0,123,2700 1/1:86:3001,86,0
chr12 93147901 . C T 3801.01 . AC=4;AF=0.667;AN=6;BaseQRankSum=-7.623;DP=164;Dels=0.00;FS=2.258;HaplotypeScore=0.8824;MLEAC=4;MLEAF=0.667;MQ=70.00;MQ0=0;MQRankSum=0.370;QD=32.77;ReadPosRankSum=-0.525;SB=-1.332e+03 GT:AD:DP:GQ:PL 1/1:0,63:63:99:2089,172,0 0/0:48,0:48:99:0,120,1581 1/1:1,52:53:99:1712,106,0
chr12 93147907 . A T 3678.01 . AC=4;AF=0.667;AN=6;BaseQRankSum=-2.338;DP=162;Dels=0.00;FS=0.000;HaplotypeScore=1.2101;MLEAC=4;MLEAF=0.667;MQ=70.00;MQ0=0;MQRankSum=-0.710;QD=31.98;ReadPosRankSum=-1.209;SB=-1.345e+03 GT:AD:DP:GQ:PL 1/1:0,61:61:99:1932,156,0 0/0:47,0:47:99:0,120,1525 1/1:1,53:54:99:1746,110,0
I can see other cases where two SNVs the same distance apart (6 bp) are output as two separate variants by HC, e.g..
chr1 13110 . G A 62.24 . AC=1;AF=0.167;AN=6;BaseQRankSum=-2.631;DP=77;Dels=0.00;FS=18.534;HaplotypeScore=3.7523;MLEAC=1;MLEAF=0.167;MQ=33.69;MQ0=0;MQRankSum=-2.277;QD=3.28;ReadPosRankSum=-0.718;SB=-5.869e-03 GT:AD:DP:GQ:PL 0/0:20,0:20:51:0,51,582 0/1:13,6:19:97:97,0,337 0/0:37,1:38:81:0,81,1122
chr1 13116 . T G 344.26 . AC=3;AF=0.500;AN=6;BaseQRankSum=4.784;DP=59;Dels=0.00;FS=9.743;HaplotypeScore=2.8686;MLEAC=3;MLEAF=0.500;MQ=33.73;MQ0=0;MQRankSum=-5.601;QD=5.83;ReadPosRankSum=-0.393;SB=-4.601e+01 GT:AD:DP:GQ:PL 0/1:11,5:16:71:71,0,272 0/1:5,5:11:99:104,0,130 0/1:20,12:32:99:208,0,537
I have two questions:
1) What determines whether two SNVs lying close together are output as a single block substitution or as two separate variants? Does it make a difference that all three samples in this file are called as homozygous for the chr12 variant? I've examined the chr1 variant in IGV and there are reads that overlap both variants for the sample with heterozygous genotypes for both.
2) Is there any option to have HC produce multiple SNV variant lines rather than a block substitution in cases like the chr12 example? (I can't see one, but thought I'd ask just in case). I ask because some downstream programs struggle with block substitutions, and also because I'd like to know how many variants are called by HC compared to UG. I would consider the chr12 example two separate variants as the two SNVs affect different codons, although I appreciate that opinions may differ as to what exactly constitutes a distinct variant in some cases.
Thanks very much in advance.