Haplotype Scoring Algorithm
Hi there,
I'm trying to understand the haplotype scoring algorithm in GATK 1.6.5. I fortunately got a printed page where I have a simple diagram that explains the algorithm, I can't find it anymore in the new web.
The case is that the formula for calculating the haplotype score in this diagram has a variable that I'am missing what it is. This is the formula as it's written:
P(read  haplotype_j) = sum_bi (bi == hi ? ei : 1  ei / 3)  sum_bi (ei)
I guess bi stands for base at position i at the current read and hi stands base at position i at haplotype_j, that makes sense for me. But, what is ei?? maybe I'm missing something... it looks like it should be a probability in the range (0, 1) for the haplotype score to make sense.
Thanks in advance!
Pablo.
Best Answer

Geraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
Well, that article links to the calling slides, which is what you needed, hmm?
The "e" is essentially the recalibrated quality score for each position. Which is the output of the BaseRecalibrator, taking into account several different covariates.
Answers
See this article:
http://www.broadinstitute.org/gatk/guide/article?id=1237
Hi,
Thanks for your answer.
They are not really talking about haplotype scoring algorithm in that article. Anyway that lead me to the fragmentbased SNP calling slides and they are referring to an "e" which is the sequencing error rate. May it be this?
We had ei, so it would be specific to that position and not a static sequencing error. I guess it is the error rate for position i in haplotype j, that might be the number of mismatches to consensus haplotype j at position i over the total counts for position i at haplotype j.
But I still doubt how are you calculating the error sequencing rate? Going through the documentation for ErrorRatePerCycle, the error sequencing rates calculated there do not match only to mismatches/counts.
Thanks!
Pablo.
Well, that article links to the calling slides, which is what you needed, hmm?
The "e" is essentially the recalibrated quality score for each position. Which is the output of the BaseRecalibrator, taking into account several different covariates.
Aaaaaaahhhh, OK. I knew it had to be something evident... that makes sense.
I just wanted to fully understand the haplotype scoring and I missed what this e was, I was thinking about calculating some error probability