How base-recalibrator deals with genuine biases in base quality?

Base quality may be genuinely associated with some covariates. For instance, average base quality may decrease at the end of the run; or average base quality may be genuinely different between the read groups that come from different runs. Will base quality recalibrator preserve the true quality differences in such cases?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's exactly what the recalibrator does: identify real covariation and adjust the confidence scores accordingly. See the method documentation for more details.

  • Dear Geraldine,

    Thank you for the prompt reply. I had read the documentation before asking the question. However, I still see the room for alternative interpretatins at some points. I wonder whether you could clarify to me what exactly recalibrator would do in the simple example below:

    • Assume I have a sample, which was sequenced twice (on two different lanes)
    • Assume that the first lane has average base quality 35 and the second has average quality 38 (as assigned by the machine)

    I merge both lanes into one bam (using two read groups IDs) and run recalibrator. My understanding is that recalibrator will

    • count errors in both reads separately (excluding the known variants) and
    • calculate the new average scores from the error rates.

    Is this correct?
    If yes, let us assume that the new average scores are 34 and 39 respectively.
    What exactly will recalibrator do with quality scores in this example?

    Thanks again for your help with this question.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Alexey,

    You are correct in assuming that the data from the read groups will be processed separately internally.

    However, I think you misunderstand what the program does. It is not concerned with calculating average base quality -- that is essentially irrelevant. What the program does is to assign new scores individually to each base of each read based on the patterns of covariation that it identified in the initial model-building. So some bases may receive a higher score while some bases receive a lower score, or all may be lowered or increased, depending on what the model finds.

    Let's say the model finds that every 10th base of each read is more likely to have an error. In that case, the program will lower the score of each 10th base of each read by, let's say 3 points. But the model also finds that every base read after a TTT sequence is more likely to be correct. Then it will increase the score of every base by maybe 2 points. It's possible that there will be some cases of a 10th base being preceded by TTT. So those bases will get -3 +2 = -1 points. Extend this logic to all the covariates that the program looks at, and you get the total effect of the processing.

    Does that clarify things?

  • Thanks again for your reply. It is getting closer to the point that I still wish to clarify :)

    I perfectly understand that the program builds a model based on a pattern of covariates and uses the model to assign the new scores to each base according to the relevant context etc etc etc … - this is well popularised in numerous places at the Broad web site and elsewhere.

    What I try to clarify is how exactly the program builds the model and how exactly it corrects the quality scores. Is there a published reference about details of the statistical model implemented in this tool?

    Actually, going through a simple example may be even more helpful. Of course, we may use the example of “every 10th base of each read”. However, I think that a single covariate represented by two groups of reads is the simplest possible case. I suppose that in this case the “model” may be reduced to 2 steps:

    1) Count error rates (mismatches with reference genome) in both groups separately, excluding the known variants (and ignoring existing quality scores?)
    2) Look what is the error rates difference between the groups

    Let’s consider a case with error rates 35 and 38 respectively. In your terms, this could mean that the first group had higher likelihood of errors than the second group, and the difference can be quantified as 3 points (=38 – 35). Is this how your model works?

    If yes, then my next question will be how exactly this 3 points difference is applied to correct bases scores (assuming there are no other covariates to consider).

    Again, I believe that going through this simple example may be a very good way to illustrate what the tool actually does. Of course, a paper describing the statistical model may also help.

    Thank you for your patience :)

Sign In or Register to comment.