Variant Recalibrator HaplotypeScore filtering

ddoopusddoopus californiaMember

First, it would be great to understand exactly what HaplotypeScore is and
how it is computed using the Unified Genotyper. I would think it has
something to do with the neighboring candidate SNPs, but these forum posts
imply otherwise and yet don't give a definitive answer:

http://gatkforums.broadinstitute.org/discussion/2521/haplotype-score-region
http://gatkforums.broadinstitute.org/discussion/3168/haplotypescore-and-nearby-variants
http://gatkforums.broadinstitute.org/discussion/2369/calculation-of-haplotypescore

One of them discusses vaguely how this is computed from multiple samples.
Does this score have any meaning if running on a single whole genome
sample?

We are interested in SNPs within segmental duplications and want to retain
calls even if there is a copy number amplification of a region. That is,
say the reference has copies A and B and our sample has A, B, and C with C
more similar to B. Our alignments will have SNPs from C overlaid ontop of B. If the
"haplotype score" is a metric to explain the observed neighboring variants
with at most two haplotypes, then I would think reads from C aligning to
B would cause the "haplotype score" for all SNPs in B to increase.
Is this the case? If so, is there a way to turn this off so that we can still
take advantage of the other features from variant recalibration or is
there a better value we can use than " > 13.0" for a hard filtering
threshold? I am assuming we will have to omit it from the actual "model
training" since most of the genome will have low haplotype scores and
consequently these points of interest will always be outliers.

We have tried a couple of things and still cannot make much sense of the
recalibrated VCF and consistently see what appears to be quality SNPs being
filtered. Using variant recalibrator with -an MQRankSum -an ReadPosRankSum
-an FS -an MQ -an HaplotypeScore, we see this SNP filtered:

chr22 21595839 . G A 1906.77 VQSRTrancheSNP99.00to99.90 ABHet=0.620;AC=1;AF=0.500;AN=2;BaseQRankSum=10.400;DP=205;Dels=0.00;FS=3.007;HaplotypeScore=12.1634;MLEAC=1;MLEAF=0.500;MQ=54.83;MQ0=0;MQRankSum=-0.093;NEGATIVE_TRAIN_SITE;QD=9.30;ReadPosRankSum=0.205;VQSLOD=-2.743e+00;culprit=MQ GT:AD:DP:GQ:PL 0/1:12 7,78:205:99:1935,0,3160

First, I don't understand how the culprit can be MQ since almost all the
reads have MAPQ=55 in this region and I see other passed variants with
lower MQ scores. How can this happen?

Second, I was unsure why this site was selected as NEGATIVE_TRAIN_SITE. I
had read negative examples were chosen as the outliers of an "initial
model" and the only thing I see as being a little outstanding is the
HaplotypeScore, so I removed -an HaplotypeScore and reran recalibration.
After, this site now has the entry:

chr22 21595839 . G A 1906.77 VQSRTrancheSNP99.00to99.90 ABHet=0.620;AC=1;AF=0.500;AN=2;BaseQRankSum=10.400;DP=205;Dels=0.00;FS=3.007;HaplotypeScore=12.1634;MLEAC=1;MLEAF=0.500;MQ=54.83;MQ0=0;MQRankSum=-0.093;QD=9.30;ReadPosRankSum=0.205;VQSLOD=-7.766e-01;culprit=MQ GT:AD:DP:GQ:PL 0/1:127,78:205:99:1935,0,3 160

Everything looks the same except this site now has a slightly higher
VQSLOD score and is no longer a NEGATIVE_TRAIN_SITE which makes me think
my initial guess was correct. However, I still don't understand at all
how the culprit could be MQ.

Any input for how to proceed given we are interested in SNPs within
segmental duplications for the case described above would be greatly
appreciated.

Thanks!
alex

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ddoopus‌

    Hi Alex,

    We know some of the annotations are not explained in proper detail, so we are working on a new set of explanations. Those should be posted soon.

    Haplotype Score is the consistency of the site with two (and only two) segregating haplotypes. It is important that any given sample has no more than 2 segregating haplotypes. If there are more than 2 segregating haplotypes, the reads are not as trustworthy because they have artefactual evidence in favor of more than two alleles. If there are more than two haplotypes at a site, then we cannot be as confident in the call because there is evidence that supports more than 2 alleles at the site and this makes it confusing. Higher scores are indicative of regions with bad alignments, often leading to artifactual SNP and indel calls. So, yes, if your "C" reads align to "B" then the haplotype score will increase.

    This score does have meaning when running on a single whole genome sample. The Haplotype Score is calculated separately for each sample, but if there are more than 1 sample, the scores are combined for a final overall score.

    As for turning off Haplotype Score, you do not have to. In VQSR, there is not a single annotation that makes a variant be filtered out or not. Filtering is done based on the VQSLOD. The "culprit" annotation in the filter field gives you the annotation that has the least good value of all the site's annotations, but it doesn't mean that value is necessarily a bad value as such. The point of VQSR is to move away from deciding that any given annotation has to pass a specific threshold. Instead, it looks at the entire set of annotations and evaluates what combinations look good or bad based on the profile of the training sites. You can generally ignore the culprit field when using VQSR; it is mostly a holdover from manual filtering.

    -Sheila

  • ddoopusddoopus californiaMember

    Hi Sheila,

    Thank you very much for the response.

    Okay, thank you for confirming what would happen in my example, but is it
    possible to further clarify how the scoring is done? Haplotypes imply
    information about more than one variant, so how is this obtained with
    Unified Genotyper? Does this score only consider other variants which
    are near enough to be connected by reads which overlap the site in
    question? Does it consider the pairs as well? Are read alignment base
    substituions used instead of other variant candidates? Sometimes I see a
    high haplotype score, but I see no other candidate sites (filtered or not)
    within a read length.

    With regards to the single sample, what do you mean by "combined for a
    final score". If there is only one sample, then shouldn't there only be
    one "haplotype score" for each site?

    I am still very confused about why the culprit would be MQ in the example
    I gave you. In the two *vcf lines I copied you the MAPQ is almost perfect
    for all the reads. How can this be the "least good value" when it is
    clear that HaplotypeScore is the only thing which looks remotely like an
    outlier? Although I am fine with ignoring this as you suggest, the end
    user should at least be able to make sense of it.

    Since we are interested only in variants in regions of segmental
    duplications (where the sample is likely to have additional copies as
    well), it seems the haplotype score for these areas will always be high
    and as shown in my example these points will be selected as outliers by
    the "inital model". Doesn't it seem necessary to remove this given this
    is the case? I understand this may lead to a higher false positive rate,
    but for our use case this is acceptable if we retain more of the true
    variants.

    If the resulting "culprit" made sense and was haplotype score
    as opposed to MQ then I would have a better idea of how to proceed, but
    without this it is almost just random guess and try at this point.

    Thanks again!
    alex

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ddoopus‌

    Hi Alex,

    Haplotype Score in this context does not necessarily contain information about more than 1 variant. Our version of haplotype score simply looks 10 bases to the left and 10 bases to the right of the site of interest. So, it constructs haplotypes of 21 bp. The haplotypes may or may not contain other variants. If there are other variants, then the haplotype score tells whether the site of interest segregates consistently with them. If the haplotypes do not contain other variants, the score is simply dependent on whether the site shows only 2 alleles. A higher score when there are no other variants surrounding the site means there are more than 2 alleles present in the reads at the site.

    For 1 sample, there is only one score calculated for each site. When I said the scores are combined, I meant they are combined when there is more than 1 sample.

    As for the culprit, we will discuss with the developers how to interpret it or potentially remove it because it is meaningless.

    You are right about not using haplotype score; this is indeed a case where you do not want to use it. Your reasoning is correct, and also, we do not recommend Haplotype Score as part of our best practices recommendations.

    We understand that filtering variants is a delicate exercise, and while we do provide suggestions that will work for most cases, it is ultimately up to the analyst to select filtering criteria. Some helpful tips for selecting criteria include plotting the distribution of annotation values in the data and paying careful attention to the recalibration plots.

    I hope this helps.

    -Sheila

  • ddoopusddoopus californiaMember

    Hi Sheila,

    Thank you very much for your input. I now have a better idea for what to try next.

    Thanks!
    alex

Sign In or Register to comment.