Variant Recalibrator HaplotypeScore filtering
First, it would be great to understand exactly what HaplotypeScore is and
how it is computed using the Unified Genotyper. I would think it has
something to do with the neighboring candidate SNPs, but these forum posts
imply otherwise and yet don't give a definitive answer:
One of them discusses vaguely how this is computed from multiple samples.
Does this score have any meaning if running on a single whole genome
We are interested in SNPs within segmental duplications and want to retain
calls even if there is a copy number amplification of a region. That is,
say the reference has copies A and B and our sample has A, B, and C with C
more similar to B. Our alignments will have SNPs from C overlaid ontop of B. If the
"haplotype score" is a metric to explain the observed neighboring variants
with at most two haplotypes, then I would think reads from C aligning to
B would cause the "haplotype score" for all SNPs in B to increase.
Is this the case? If so, is there a way to turn this off so that we can still
take advantage of the other features from variant recalibration or is
there a better value we can use than " > 13.0" for a hard filtering
threshold? I am assuming we will have to omit it from the actual "model
training" since most of the genome will have low haplotype scores and
consequently these points of interest will always be outliers.
We have tried a couple of things and still cannot make much sense of the
recalibrated VCF and consistently see what appears to be quality SNPs being
filtered. Using variant recalibrator with -an MQRankSum -an ReadPosRankSum
-an FS -an MQ -an HaplotypeScore, we see this SNP filtered:
chr22 21595839 . G A 1906.77 VQSRTrancheSNP99.00to99.90 ABHet=0.620;AC=1;AF=0.500;AN=2;BaseQRankSum=10.400;DP=205;Dels=0.00;FS=3.007;HaplotypeScore=12.1634;MLEAC=1;MLEAF=0.500;MQ=54.83;MQ0=0;MQRankSum=-0.093;NEGATIVE_TRAIN_SITE;QD=9.30;ReadPosRankSum=0.205;VQSLOD=-2.743e+00;culprit=MQ GT:AD:DP:GQ:PL 0/1:12 7,78:205:99:1935,0,3160
First, I don't understand how the culprit can be MQ since almost all the
reads have MAPQ=55 in this region and I see other passed variants with
lower MQ scores. How can this happen?
Second, I was unsure why this site was selected as NEGATIVE_TRAIN_SITE. I
had read negative examples were chosen as the outliers of an "initial
model" and the only thing I see as being a little outstanding is the
HaplotypeScore, so I removed -an HaplotypeScore and reran recalibration.
After, this site now has the entry:
chr22 21595839 . G A 1906.77 VQSRTrancheSNP99.00to99.90 ABHet=0.620;AC=1;AF=0.500;AN=2;BaseQRankSum=10.400;DP=205;Dels=0.00;FS=3.007;HaplotypeScore=12.1634;MLEAC=1;MLEAF=0.500;MQ=54.83;MQ0=0;MQRankSum=-0.093;QD=9.30;ReadPosRankSum=0.205;VQSLOD=-7.766e-01;culprit=MQ GT:AD:DP:GQ:PL 0/1:127,78:205:99:1935,0,3 160
Everything looks the same except this site now has a slightly higher
VQSLOD score and is no longer a NEGATIVE_TRAIN_SITE which makes me think
my initial guess was correct. However, I still don't understand at all
how the culprit could be MQ.
Any input for how to proceed given we are interested in SNPs within
segmental duplications for the case described above would be greatly