It looks like you're new here. If you want to get involved, click one of these buttons!
I've just run the BaseRecalibrator on some whole genome sequences, and while scanning through the recalibration file, I noticed that some of the bases at the beginning and ends of reads were getting very high recalibration values:
SxaQSEQsXAP010_lane_1 6 -99 Cycle M 7.5248 416048 73563
SxaQSEQsXAP010_lane_1 6 99 Cycle M 6.7402 271864 57587
SxaQSEQsXAP010_lane_1 6 -100 Cycle M 30.1585 519622 500
SxaQSEQsXAP010_lane_1 6 100 Cycle M 30.7455 408415 343
SxaQSEQsXAP010_lane_1 7 1 Cycle M 37.0476 55736 10
SxaQSEQsXAP010_lane_1 7 2 Cycle M 9.6561 55347 5990
...
SxaQSEQsXAP010_lane_1 7 -99 Cycle M 9.3230 14040721 1640938
SxaQSEQsXAP010_lane_1 7 99 Cycle M 9.0272 10199039 1275971
SxaQSEQsXAP010_lane_1 7 -100 Cycle M 33.1557 23210317 11222
SxaQSEQsXAP010_lane_1 7 100 Cycle M 33.9099 21072616 8564
SxaQSEQsXAP010_lane_1 8 -6 Cycle M 7.2585 42164 7926
...
SxaQSEQsXAP010_lane_1 21 -98 Cycle M 22.7383 839160 4466
SxaQSEQsXAP010_lane_1 21 98 Cycle M 22.5192 716787 4012
SxaQSEQsXAP010_lane_1 21 -99 Cycle M 39.9141 872572 88
SxaQSEQsXAP010_lane_1 21 99 Cycle M 40.9464 696355 55
SxaQSEQsXAP010_lane_1 21 -100 Cycle M 38.9586 999226 126
SxaQSEQsXAP010_lane_1 21 100 Cycle M 39.2492 799184 94
SxaQSEQsXAP010_lane_1 22 -1 Cycle M 37.2879 69618 12
SxaQSEQsXAP010_lane_1 22 1 Cycle M 36.5709 108966 23
SxaQSEQsXAP010_lane_1 22 -2 Cycle M 37.7221 35509 5
SxaQSEQsXAP010_lane_1 22 2 Cycle M 37.9585 99992 15
SxaQSEQsXAP010_lane_1 22 -3 Cycle M 21.2202 62377 470
SxaQSEQsXAP010_lane_1 22 3 Cycle M 23.3286 118578 550
A possible explanation is that the aligner (novoalign) is clipping any bases which mismatch, and so there are very few mismatches at the ends and beginnings of reads. That would mean that there are actually very few errors at the beginning and ends of reads, and empirically, the measured quality is high.
However, even if this is correct, I'm wondering if I should trust the recalibration: A base which was originally marked with a quality of 6 or 7 suddenly has the possibility of getting a big boost (modulo any other covariates).
Do you have any thoughts, suggestions, or other possible explanations?
Thanks,
Kevin
Comments
Hi Kevin,
Clipping off mismatching bases on the edges of reads would create a bias like the one you see here in the machine cycle covariate. Before we can decide on the magnitude of the effect however it would be good to create the before and after recalibration accuracy plots (you can follow the steps outlined in this thread: http://gatkforums.broadinstitute.org/discussion/1539/baserecalibrator-plots#latest)
As you say, it isn't so obvious that those Q6 and Q7 bases are actually being boosted up because the recalibration depends on the correction factor from both the sequencing context and the original quality score itself.
Finally, if you are specifically worried about those very low qual bases you can set the
--preserve_qscores_less_thanargument to something like 10 (the default value is Q6). This will effectively ignore those bases and leave them alone at their low values.I hope that helps,
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Thanks Ryan. It might take me a little while to generate those plots, as the first set of plots didn't get generated (as with http://gatkforums.broadinstitute.org/discussion/1297/no-plots-generated-by-the-baserecalibrator-walker), and the file is split up. I might just run it on one of the larger chunks.
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·