Qempirical and recalibrated quality score recal

I read the GATK paper, “A framework for variation discovery and genotyping using next-generation DNA sequencing data”.

In the ONLINE METHODS- Base quality score recalibration section, I understand the calculation of Qempirical(R,C,D). At this point, we’ve already get a recalibrated quality score, why we need to go further to get recal(r,c,d)? what’s the difference between “R,C,D” set and “r,c,d” set? The former is a superset of the latter?
Thanks!

Answers

  • MartinWMartinW USMember

    a little disappointed to find that my question is still not answered.
    Isn't this the right place to ask this question? Isn't this a straightforward question for GATK authors?

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi MartinW,

    Sorry to have disappointed you, but you do realize that it has been only one day, right? And that you posted on a Sunday which is not generally a work day? And that you posted a question that refers back to a paper from several years ago?

    Geraldine, as always, has done her job and earlier today asked the author of the tool to go back and look up the equation from the old paper. But he is very focused on other cutting edge research these days, so you will need to wait until he has a few free cycles to look at this. Impatience on your end only serves to bump this lower down the priority list.

  • rpoplinrpoplin Member ✭✭✭
    edited November 2013

    Hi there,

    It looks like there is confusion between "R,C,D" and "r,c,d" in the equation. The capital letters represent the set of all possible values for that covariate while the little letters are specific values for that covariate.

    So, Qempirical(R,C,D) is the empirical quality for the data marginalized over all possible values for R, C, and D. This is essentially the average quality score over every base in the lane. Not a very useful estimate for the quality score of a base.

    recal(r,c,d) is the recalibrated quality for a base with specific covariate values r, c, and d.

    I hope that helps.
    Cheers,

  • MartinWMartinW USMember

    Hi rpoplin,
    Your note makes a lot of sense. I was confused because Qempirical(R,C,D) was defined in the paper as empirical quality score for each category rather than a marginalized score or average score over every base in the lane. And here is the context where Qempirical(R,C,D) was defined in the paper:
    “For each lane, the algorithm first tabulates empirical mismatches to the reference at all loci not known to vary in the population (dbSNP build 129), categorizing the bases by their reported quality score (R), their machine cycle in the read (C) and their dinucleotide context (D). For each category we estimate the empirical quality score:..”
    Is this statement accurate? If you are sure Qempirical(R,C,D) is a marginalized score or average score, then my question is answered. Thank you!

Sign In or Register to comment.