Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Rationale behind MuTect2 and Haplotype Caller

Dear GATK team,

I would like to better understand MuTect2 and HaplotypeCaller in order to present the methods, as I am doing a Thesis on Somatic Mutation Discovery. Unfortunately the only reference that I have is the original MuTect paper.

First question : Is there any reference that explain how HaplotypeCaller works in detail ? (So it could answer my questions)

I nevertheless have some questions about how the original MuTect is intricated with the HaplotypeCaller.

As far as I understood the HaplotypeCaller use 4 steps (and MuTect2 also I assume?) :
1. Define active regions
2. Determine haplotypes by assembly of the active region
3. Determine likelihoods of the haplotypes given the read data
4. Assign sample genotypes

And as far as I understood, to define the active regions, the original MuTect TLOD is used (and I don't know about the NLOD).

My questions are :

  • Are likelihoods calculated with the PairHMM in 3) linked to the TLOD and NLOD ? Is it used to select variants ?
  • At the end, what are the parameters that allow to give the "PASS" ? I know there is the TLOD and the NLOD thresholds (I know TLOD>6.3 from the orignal MuTect), but how the steps 2) 3) and 4) are affecting the labelling of a variant as "PASS" ?

Is there any paper on that method that will be released soon ?

Thank you very much in advance ! Have a nice day. Kind regards,

Alexandre Coudray

Best Answer

Answers

  • ac67479ac67479 AustinMember

    Hey, so I ended up on a page that explained pretty well the principle of Haplotype Caller (http://gatkforums.broadinstitute.org/gatk/discussion/4148/hc-overview-how-the-haplotypecaller-works). I am still wondering how TLOD and NLOD are involved and where. In step 4, we choose the Genotype with the highest likelihood and if it is not the homozygous reference, we declare the site as being a variant. What about that final P(G|D) ? can we retrieve this value somewhere ? Are the NLOD/TLOD only used in the step 1 to select active regions ?

Sign In or Register to comment.