To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits
How baserecalibrator deals with genuine biases in base quality?
alexey_larionov
UKMember
Base quality may be genuinely associated with some covariates. For instance, average base quality may decrease at the end of the run; or average base quality may be genuinely different between the read groups that come from different runs. Will base quality recalibrator preserve the true quality differences in such cases?
Tagged:
Best Answer

Geraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
Ah, if what you want is the detailed implementation, then I recommend you read the source code itself, which is available on github. We have not published the exact algorithm details and we can't spare the time to write that up now.
Answers
That's exactly what the recalibrator does: identify real covariation and adjust the confidence scores accordingly. See the method documentation for more details.
Dear Geraldine,
Thank you for the prompt reply. I had read the documentation before asking the question. However, I still see the room for alternative interpretatins at some points. I wonder whether you could clarify to me what exactly recalibrator would do in the simple example below:
I merge both lanes into one bam (using two read groups IDs) and run recalibrator. My understanding is that recalibrator will
Is this correct?
If yes, let us assume that the new average scores are 34 and 39 respectively.
What exactly will recalibrator do with quality scores in this example?
Thanks again for your help with this question.
Hi Alexey,
You are correct in assuming that the data from the read groups will be processed separately internally.
However, I think you misunderstand what the program does. It is not concerned with calculating average base quality  that is essentially irrelevant. What the program does is to assign new scores individually to each base of each read based on the patterns of covariation that it identified in the initial modelbuilding. So some bases may receive a higher score while some bases receive a lower score, or all may be lowered or increased, depending on what the model finds.
Let's say the model finds that every 10th base of each read is more likely to have an error. In that case, the program will lower the score of each 10th base of each read by, let's say 3 points. But the model also finds that every base read after a TTT sequence is more likely to be correct. Then it will increase the score of every base by maybe 2 points. It's possible that there will be some cases of a 10th base being preceded by TTT. So those bases will get 3 +2 = 1 points. Extend this logic to all the covariates that the program looks at, and you get the total effect of the processing.
Does that clarify things?
Thanks again for your reply. It is getting closer to the point that I still wish to clarify
I perfectly understand that the program builds a model based on a pattern of covariates and uses the model to assign the new scores to each base according to the relevant context etc etc etc …  this is well popularised in numerous places at the Broad web site and elsewhere.
What I try to clarify is how exactly the program builds the model and how exactly it corrects the quality scores. Is there a published reference about details of the statistical model implemented in this tool?
Actually, going through a simple example may be even more helpful. Of course, we may use the example of “every 10th base of each read”. However, I think that a single covariate represented by two groups of reads is the simplest possible case. I suppose that in this case the “model” may be reduced to 2 steps:
1) Count error rates (mismatches with reference genome) in both groups separately, excluding the known variants (and ignoring existing quality scores?)
2) Look what is the error rates difference between the groups
Let’s consider a case with error rates 35 and 38 respectively. In your terms, this could mean that the first group had higher likelihood of errors than the second group, and the difference can be quantified as 3 points (=38 – 35). Is this how your model works?
If yes, then my next question will be how exactly this 3 points difference is applied to correct bases scores (assuming there are no other covariates to consider).
Again, I believe that going through this simple example may be a very good way to illustrate what the tool actually does. Of course, a paper describing the statistical model may also help.
Thank you for your patience
Ah, if what you want is the detailed implementation, then I recommend you read the source code itself, which is available on github. We have not published the exact algorithm details and we can't spare the time to write that up now.