Do you guys have any plans to get BD and BI tags added to the SAM spec?
we are talking to the relevant parties about this. Since the use of these values is still a research topic within the group, we'll hold off until we seal the importance of it.
More importantly, we'd like the sequencing instruments to report those values for us to recalibrate. Currently, only Pacbio does so.
As Carneiro mentioned, the reason why we haven't really pushed for changes to the BAM spec is because we haven't been excited by the prospect of adding 2 more strings of quality scores to every read. The improvement in indel accuracy was overshadowed by the ballooning of the BAM file. We've been developing a new indel error model which supersedes the BQSR and achieves even better accuracy while using much fewer parameters. We are hoping to have it in place by the release of version 2.6
For the curious the new model uses the Cycle covariate from the BQSR and a new covariate which describes tandem repeats. The win is that we no longer use the full space of preceding base context which results in a massive reduction in the parameter space.
I hope that is helpful. Let me know if you have any other questions.
The reason I ask is that I'm currently researching the purity filtering and recalibration we do here at the Sanger. It uses the tagged spiked in PhiX to recalibrate the quality scores, and I was thinking about whether it would be worth it how hard it would be to extend it to implement BD and BI scores. I'm also trying to figure out how it interacts with BQSR and trying to figure out whether it's a good or bad thing to do both. Also it looks like a new version of the SAM spec might be tagged in response to the abilities of the new BWA algorithm.
@rpoplin is going to get in touch with you to discuss this topic in more detail. We are still treating this as a research topic and we have other approaches that are being evaluated before we set on the standard of having 3 strings of base qualities.
Could you please advise in GATK v3.4, does the UG and HC still uses BI and BD tags? Any other possible downstream applications would need BD and BI? If these tags are not used downstream anymore, can be stripped to save some space.
The BD and BI tags are the indel base qualities added by BQSR. They are still used by GATK, so you should not get rid of them.