Correct understanding of BQSR
I just wanted to reach clarification on some issues related to BQSR.
We work on bacterial genomes, with approximately 8000 in the collection and about 120 new every week.
We asked ourself if BQSR is beneficial or harmfull for our data. As we have no big confidence SNP database yet, we tried to use a small SNP list for which every single isolate of course only has a small number of matches. Any sense in doing this? We see an improvement in the plots, but also have the fear that real SNP positions not covered by the list will have their quality values decreased and may not be called later on. Would this be the case? Wouldn't this degrade sensitivity? What happens to such positions? Would the ability to detect novell SNPs be impaired? Because detecting novell SNPs at all genome positions would be necessary.
In this case, would it make sense to bootstrap a list from a subset of these 8000 genomes and use this as recalibration list? But apparently this also would miss possible new SNP positions in new sequenced isolates. Does this mean that for every subset to be analyzed we would have to bootstrap a new SNP list for recalibration?
As we are also interested in low frequency SNPs the recalibration seems even more inappropriate. Will positions not covered from the SNP list and with only few missmatching reads, but a real subpopulation, will have their quality lowered and by this make the identification harder?
In conclusion we think we would be far better off with skipping BQSR and go directly to variant calling.
Does this make sense?