Correct understanding of BQSR

Hi,

I just wanted to reach clarification on some issues related to BQSR.
We work on bacterial genomes, with approximately 8000 in the collection and about 120 new every week.

We asked ourself if BQSR is beneficial or harmfull for our data. As we have no big confidence SNP database yet, we tried to use a small SNP list for which every single isolate of course only has a small number of matches. Any sense in doing this? We see an improvement in the plots, but also have the fear that real SNP positions not covered by the list will have their quality values decreased and may not be called later on. Would this be the case? Wouldn't this degrade sensitivity? What happens to such positions? Would the ability to detect novell SNPs be impaired? Because detecting novell SNPs at all genome positions would be necessary.

In this case, would it make sense to bootstrap a list from a subset of these 8000 genomes and use this as recalibration list? But apparently this also would miss possible new SNP positions in new sequenced isolates. Does this mean that for every subset to be analyzed we would have to bootstrap a new SNP list for recalibration?

As we are also interested in low frequency SNPs the recalibration seems even more inappropriate. Will positions not covered from the SNP list and with only few missmatching reads, but a real subpopulation, will have their quality lowered and by this make the identification harder?

In conclusion we think we would be far better off with skipping BQSR and go directly to variant calling.

Does this make sense?

Thank you!

Issue · Github
by Sheila

Issue Number
1154
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @chickenmcfu (nice username btw),

    Bootstrapping a list of known sites from a subset of your 8000 genomes is a good idea (much better than just skipping BQSR). It won't mask out all sites, but that's okay. We expect that there will be new unmasked sites in every new isolates, and as long as their numbers are moderately small compared to all the sites that are masked, they should not get penalized since the model should not find any systematic covariation among their properties. This is because recalibration is not based on looking at specific sites, but on detecting patterns of error across large numbers of base calls.

  • SumuduSumudu Sri LankaMember
    edited August 2016

    Hi,

    I'm working on a parasite genome with no known SNP/indel sites. I decided to bootstrap a set of known sites to perform BQSR. Few things to clarify which I couldn't find a proper answer to my questions after going through most of the FAQ and common problems.

    1). Documentation says to do bootstrapping until convergence, mean til you get a same fixed set of variants? How would you recommend to check that you get a fixed set after few iterations? Visualizing using IGV or using a tool like VCFtools etc?

    2). In bootstrapping step, is it ok to use a simple filter like QUAL > some figure? If so, What is the best way to select the QUAL figure to filter variants in my raw data?
    Or use stringent filtering for SNPs & indels as per recommendations in GATK documents?

    3). Suppose we get a reliable known set by bootstrapping, the next step would be to run BaseRecalibrator using those known sets and original data & then to run PrintReads to get a recalibrated BAM file ? Is this correct?

    4). Next using that recalibrated BAM, we can run HaplotypeCaller to get a raw variants set. Then is it fine to do VQSR on those raw variants using the known set we obtained before Or to apply the same hard-filtering step again on those raw variants? What would be the best way?

    5) I saw somewhere in the GATK documents that it is not necessary to run IndelRealignment if you run HaplotypeCaller? And in my case is it fine to do IndelRealignment without any knownSites information since I don't have such ?

    Appreciate any advice on these since I desperately want to solve these issues to move forward.

    Thanks
    Best Regards
    Sumudu

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Sumudu
    Hi Sumudu,

    1) Convergence means that there is no longer any substantial change in the effect of recalibration between iterations, which indicates that you have produced a set of known sites that adequately masks true variation in the data. When you look at the plots produced by AnalyzeCovariates, ideally you're hoping for the "after" dots to line up nicely if recalibration worked well, whereas the before dots may be somewhat scattered.

    Basically, you will compare the plots from AnalyzeCovariates each time you do a round of bootstrapping, and when there is not much change int ehe plots between the iterations, it means you have reached convergence.

    2) We have some basic hard filtering recommendations here. Also have a look at the recommendations here for adapting the hard filters to your dataset.

    3) Yes.

    4) Have a look at Geraldine's response in this thread and in this thread.

    5) Yes, the Indel Realignment step is not necessary. https://software.broadinstitute.org/gatk/blog?id=7847

    -Sheila

  • SumuduSumudu Sri LankaMember

    Hi Sheila,

    Thank you very much. Your answer helped me a lot to get through my issues.
    One more. If I use Samtools Mpileup and HaplotypeCaller to get two sets of variants and select the ones in agreement between the two and select them as known-variants in BQSR instead of bootstrapping, Would this be appropriate?

    Best Regards,
    Sumudu

    Issue · Github
    by Sheila

    Issue Number
    1176
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @Sumudu,

    That kind of approach (taking the consensus of several callers) can work in principle, but be aware that it has some flaws. For example, some variants may be represented differently between callers, and not be detected as the same. Also, if one variant calls one class of variants (for example indels) much better than the other, you may fail to mask out many real variants as a result. So, it can work, but we can't guarantee that it will work well.

  • SumuduSumudu Sri LankaMember

    Thank you Geraldine_VdAuwera for the answer.

Sign In or Register to comment.