Testdrive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a userfriendly interface!) without having to install anything.
How baserecalibrator deals with genuine biases in base quality?
alexey_larionov
UKMember ✭
Base quality may be genuinely associated with some covariates. For instance, average base quality may decrease at the end of the run; or average base quality may be genuinely different between the read groups that come from different runs. Will base quality recalibrator preserve the true quality differences in such cases?
Tagged:
Best Answer

Geraldine_VdAuwera Cambridge, MA admin
Ah, if what you want is the detailed implementation, then I recommend you read the source code itself, which is available on github. We have not published the exact algorithm details and we can't spare the time to write that up now.
Answers
That's exactly what the recalibrator does: identify real covariation and adjust the confidence scores accordingly. See the method documentation for more details.
Dear Geraldine,
Thank you for the prompt reply. I had read the documentation before asking the question. However, I still see the room for alternative interpretatins at some points. I wonder whether you could clarify to me what exactly recalibrator would do in the simple example below:
I merge both lanes into one bam (using two read groups IDs) and run recalibrator. My understanding is that recalibrator will
Is this correct?
If yes, let us assume that the new average scores are 34 and 39 respectively.
What exactly will recalibrator do with quality scores in this example?
Thanks again for your help with this question.
Hi Alexey,
You are correct in assuming that the data from the read groups will be processed separately internally.
However, I think you misunderstand what the program does. It is not concerned with calculating average base quality  that is essentially irrelevant. What the program does is to assign new scores individually to each base of each read based on the patterns of covariation that it identified in the initial modelbuilding. So some bases may receive a higher score while some bases receive a lower score, or all may be lowered or increased, depending on what the model finds.
Let's say the model finds that every 10th base of each read is more likely to have an error. In that case, the program will lower the score of each 10th base of each read by, let's say 3 points. But the model also finds that every base read after a TTT sequence is more likely to be correct. Then it will increase the score of every base by maybe 2 points. It's possible that there will be some cases of a 10th base being preceded by TTT. So those bases will get 3 +2 = 1 points. Extend this logic to all the covariates that the program looks at, and you get the total effect of the processing.
Does that clarify things?
Thanks again for your reply. It is getting closer to the point that I still wish to clarify
I perfectly understand that the program builds a model based on a pattern of covariates and uses the model to assign the new scores to each base according to the relevant context etc etc etc …  this is well popularised in numerous places at the Broad web site and elsewhere.
What I try to clarify is how exactly the program builds the model and how exactly it corrects the quality scores. Is there a published reference about details of the statistical model implemented in this tool?
Actually, going through a simple example may be even more helpful. Of course, we may use the example of “every 10th base of each read”. However, I think that a single covariate represented by two groups of reads is the simplest possible case. I suppose that in this case the “model” may be reduced to 2 steps:
1) Count error rates (mismatches with reference genome) in both groups separately, excluding the known variants (and ignoring existing quality scores?)
2) Look what is the error rates difference between the groups
Let’s consider a case with error rates 35 and 38 respectively. In your terms, this could mean that the first group had higher likelihood of errors than the second group, and the difference can be quantified as 3 points (=38 – 35). Is this how your model works?
If yes, then my next question will be how exactly this 3 points difference is applied to correct bases scores (assuming there are no other covariates to consider).
Again, I believe that going through this simple example may be a very good way to illustrate what the tool actually does. Of course, a paper describing the statistical model may also help.
Thank you for your patience
Ah, if what you want is the detailed implementation, then I recommend you read the source code itself, which is available on github. We have not published the exact algorithm details and we can't spare the time to write that up now.