Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)

This document details the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.


Contents

  1. Overview
  2. Evaluating the evidence for each candidate haplotype
  3. Evaluating the evidence for each candidate site and corresponding alleles

1. Overview

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.


2. Evaluating the evidence for each candidate haplotype

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). Note: If reads from a pair overlap at a site and they have the same base, the base quality is capped at Q20 for both reads (Q20 is half the expected PCR error rate). If they do not agree, we set both base qualities to Q0.

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.


3. Evaluating the evidence for each candidate site and corresponding alleles

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods.

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

Post edited by Sheila on

Comments

  • Hi, recently I read the code of haplotypeCaller, and I find that there are two hmm-model in GATK, one in BaseRecalibrator called profile hmm, and one in haplotypeCaller called pair Hmm. I want to know the difference between these two hmms. So, I read the paper wrote by li heng and the book wrote by R.Durbin.

    profile hmm
    pair hmm

    Though the topology between these two hmms is quite different with a glance. But in fact they are almost the same, take a close look at the transmission matrix, they just differ in the entry value of transmission matrix, and when you ignore the start and end status in profile hmm, you will see that both hmms have the same hypothesis, that is to say, insertion state and deletion state can not transfer from each other.

    I wonder if my opinion is right. Could GATK team give me some advice? Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jesson
    Hi,

    Yes, I think you are correct.

    -Sheila

  • jessonjesson Member
    edited September 12

    @Sheila
    Hi,

    Thanks for your reply! And that's where the problem comes, why the two hmms(same in Essentially) named with different names? Is it because the history reason or they are proposed in different author and published in different time?

    Another question is that I don't understand the code of pairHmm for evaluating the readlikelihood vs haplotype in HaplotypeCaller.

    I understand the prior of matchMatrix which is in fact the emission probability, but where is the emission probability of the equation insertionMatrix[i][j](also for the equation of deletionMatrix[i][i]), is the probability of the four characters it emits is 1("_A","_C","_G","_T"), I don't understand, or there is something that I missed? In heng's paper, the emission probability of insertion state is set to 0.25, I think it is the reason that the four characters("_A","_C","_G","_T") has the same emission probability. And the emission probability of deletion state is set to 1 is because it can only emits ("A_","C_","G_","T_") which could be seemed to one character *_ and can be explained as "deletion status will always emit symbol *_"? So, how do you explain the emission probability of your pairHmm insertion status and deletion status setting?

    The last question is how you set the initial probability of the hidden state?

    I think the initial hidden status in your pairHMM code is deletion, because all the non-zero value comes from this equation deletionMatrix[0][j] = initialValue; in the figure. But if it is the initial hidden state, why you didn't make it sum to 1?

    I will feel sorry for If I asked the stupid question. I like the document and the wonderful code of GATK, because I learned a lot from it, I really hope you will give me a clear derivation for my questions. Thank you very very much!

    -ls

    Post edited by jesson on
  • blueskypyblueskypy Member ✭✭

    In the last paragraph

    sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant

    what is the "sufficient evidence"?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @blueskypy,

    Sheila has left the team for greener pastures. Can you please provide more context for your question, e.g. a link to the source of this passage, and clarify what it is you'd like help understanding? Thank you.

  • blueskypyblueskypy Member ✭✭

    Thanks @shlee for letting me know! I bet your whole team is appreciated by hundreds of thousands of users in the whole world!

    Regarding the question, I literally mean the last paragraph of this article.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hey @blueskypy,
    I'm just helping out our new forum support person today and I'm noticing that you've posted four similar questions across different threads for the last few days. Would it be possible for you to consolidate all of your questions into a single new post?

    Here are your four threads:

    Thanks.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @blueskypy,

    In the last paragraph

    sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant

    what is the "sufficient evidence"?

    My understanding is that after PairHMM scoring of alignments (probabilities of haplotype-read pairs), we obtain a matrix of likelihoods of the haplotypes given the reads. Next, we transform support for haplotypes into support for alleles. The latter is necessary to determine genotype. In this transformation, we take the highest per-read likelihood of haplotypes with allele. Only then can we move on to genotyping, where we must determine the most likely genotype at each event.

    Here is a workshop slide to illustrate:

    Again, my understanding is that when transforming haplotypes to alleles, read-likelihoods with the highest scores carry forward. For each read, the highest scoring alleles carry forward.

    If you feel this could use more clarification, please let us know.

  • Chris_MChris_M Member

    Hi,

    I would really appreciate if you could make a worked example of this calculation?
    I have searched for several hours for a worked example and cannot find any.

    Having a site like this (called using HaplotypeCaller) : GT:AD:DP:GQ:PL 0/1:4,28:32:8:806,0,8

    From a frequentist point of view the probability of getting 4 reads supporting reference and 28 supporting alternative allele assuming the GT is 0/1 is extremely low. So I would really appreciate if you can give a detailed description including the constants you are using?

    Hope you can help.

    -Chris_M

  • Chris_MChris_M Member
    edited 12:02PM

    @shlee, In continuation of the above question:

    For low coverage is seem initiative to use Bayesian statics - however for high coverage (~30x) there seems to be a issue where the priors favors the reference GT above the alternative - the variant calling must be reference biased to give the resulting GT, PLs mentioned in above post. Above 20x a frequentist approach would make more sense.

    If, inf fact, the reference is favored using the default priors then how many true heterozygote sites are wrongly genotyped as 0/0 - this could potentially be a big problem. There could be a lot of variation lost using inappropriate priors.

    What is the default priors and how can you change them so all have equal prior? Or maybe use a frequenist approch for sites that have high coverage?

    Please comment on this.

    Thanks,
    -Chris_M

Sign In or Register to comment.