how to do BQSR on WES or other targeted sequencing?

blueskypyblueskypy Member ✭✭
edited February 28 in Ask the GATK team

I have this question because of the following statement from this article:

In addition, there are some processing steps, such as BQSR, that should be restricted to the capture targets in order to eliminate off-target sequencing data, which is uninformative and is a source of noise.

My understanding is that it says the bias in BQS is different between on and off targets, i.e. the recalibration models should be different between on and off target bases. If that's correct, seems to me there are two proper ways of doing BQSR in WES, neither of which is in the Broad Official WES pipeline.
1. Keep variants only in the targets, i.e. apply -L targets.bed -ip 100 at both steps of BQSR.
2. Keep variants in both on- and off-targets.
2.1. apply -L targets.bed at both steps of BQSR
2.2 apply -XL targets.bed at both steps of BQSR
2.3 merge the two bam files from 2.1 and 2.2

Just wonder if anyone can comment?


  • blueskypyblueskypy Member ✭✭

    Just a follow-up question, if I use method 2, should I do the same steps for VQSR?

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @blueskypy,

    I'll need to refer to the dev team but will get back to you with some answers.

  • blueskypyblueskypy Member ✭✭

    Thanks @bshifaw ! looking forward to the answers!

  • mshandmshand Member, Broadie, Dev ✭✭

    Hi @blueskypy,

    If you only care about variants that are over the targets, then I would recommend running BQSR, HaplotypeCaller, and VQSR only over the targets (so option 1). That will simplify your pipeline while keeping things consistent.

    If you are specifically trying to look at variants that are off target, you could use the pipeline you're suggesting of training two separate models (option 2) for on targets and off targets. However, my intuition is that in practice this won't dramatically change your sensitivity or precision, although someone would have to try both ways to say definitively if this is worth it or not. This will also depend on your data and if you expect your off target base qualities to behave differently than on target for some reason.

    If you do decide to go with option 2 because you are looking for variants that are off target, I would still recommend running VQSR in one step without separating into off target and on target calls. VQSR needs lots of data to train the model and having more low quality calls that are off target might actually help your performance in your final callset.

  • blueskypyblueskypy Member ✭✭

    hi, @mshand, thanks for the comment! there are actually two other options:
    3. set -L targets.bed -ip 100 only at first step of BQSR to use the recal model trained from on-target bases to recal both on- and off-target bases.
    4. no -L in either step. Is this the option of the Broad Official WES pipeline? if so, seems it is not consistent with my above quote from the article.

    yes, I want to keep both the on- and off-target bases in order to check variants in both regions. I think the key question is whether the on- and off-target bases need different recalibration models. If not, I can just go with option 4.

  • mshandmshand Member, Broadie, Dev ✭✭

    You're correct that in the Broad Official WES pipeline we train and apply the BQSR model over the whole genome. I believe this is because for the data we typically process here, we haven't noticed a difference in base qualities between on and off target. There is some current exploration in limiting where we train BQSR for whole genomes (to remove regions that we know are problematic such as the centromeres) and so far the results have shown that it doesn't make a difference either way. My guess is that the same would be true for the exome, that it won't make a big difference either way. But again, I can't promise that without actually running both ways and comparing.

  • blueskypyblueskypy Member ✭✭

    Great to know! Thanks, @mshand !

Sign In or Register to comment.