We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How does the BQSR step not create bias in SNP detection?

Hello,
I am using the GATK best practices to call variants in my RNA-seq data. So far, I have completed all of the steps up to the base recalibration (I skipped the optional indel step). I have been doing a lot of reading on the forum to try to understand the BQSR step. I do not have a set of known variants, so I will need to do the bootstrapping method you described in order to complete the BQSR steps. I understand how this process works, as in how to do the SNP calling and then use the passed reads as the “known variants” input vcf and repeat to convergence. However, I am having trouble understanding how I am not creating a huge amount of bias. From what I understood in the BQSR documentation, the SNPs in the known variants file will be masked (skipped over?), while all SNPs that mismatch but were not found in the known variants file will then be further analyzed (machine learning?) and given a new quality score. When trying to understand this process, it seems like I’m just recalibrating SNPs that were not in my known variants file, but my known variants file is full of SNPs that were detected without recalibration lol. Furthermore, if I were using a set of dbSNPs, it seems like I would be biasing myself even more, and making it more likely to call a SNP of a population related to those dbSNPs. I don’t have a strong background in stats, so I know I must be missing something, or misunderstanding something important! I think it may have something to do with something on one of your forums about the machine learning looking for systematic errors, but it still seems like I am putting in a lot of bias. I hope this makes sense and wasn’t too confusing! Any help to make me better understand how this process works without creating bias is greatly appreciated!!! Thank you
Best Answer
-
Sheila Broad Institute admin
@lfall
Hi,while all SNPs that mismatch but were not found in the known variants file will then be further analyzed (machine learning?) and given a new quality score.
This is not exactly true. The bases that mismatch the reference that are not in the known sites file are treated as "errors". The tool looks for features associated with those "errors". The features may be base context or position or other things (there are ~17 different features the tool looks at). If it consistently sees an error in one of the features, it will lower the base quality of all bases with that feature, not just the "error" bases. For example, if the tool sees an "error" at every position 10 in the reads, all base quality scores in position 10 will be lowered.
The team has done some testing and found that the different feaures the tool uses are the most likely to have errors from sequencers. Keep in mind sequencing errors occur pretty systematically and more frequently than novel variants. If the novel variant sites have high base qualities and are appearing in a variety of contexts (not just the ones we look for), the base qualities will not change significantly.
I hope this helps.
-Sheila
Answers
@lfall
Hi,
This is not exactly true. The bases that mismatch the reference that are not in the known sites file are treated as "errors". The tool looks for features associated with those "errors". The features may be base context or position or other things (there are ~17 different features the tool looks at). If it consistently sees an error in one of the features, it will lower the base quality of all bases with that feature, not just the "error" bases. For example, if the tool sees an "error" at every position 10 in the reads, all base quality scores in position 10 will be lowered.
The team has done some testing and found that the different feaures the tool uses are the most likely to have errors from sequencers. Keep in mind sequencing errors occur pretty systematically and more frequently than novel variants. If the novel variant sites have high base qualities and are appearing in a variety of contexts (not just the ones we look for), the base qualities will not change significantly.
I hope this helps.
-Sheila
Thank you very much! This was very helpful