Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How does the BQSR step not create bias in SNP detection?

Hello,
I am using the GATK best practices to call variants in my RNA-seq data. So far, I have completed all of the steps up to the base recalibration (I skipped the optional indel step). I have been doing a lot of reading on the forum to try to understand the BQSR step. I do not have a set of known variants, so I will need to do the bootstrapping method you described in order to complete the BQSR steps. I understand how this process works, as in how to do the SNP calling and then use the passed reads as the “known variants” input vcf and repeat to convergence. However, I am having trouble understanding how I am not creating a huge amount of bias. From what I understood in the BQSR documentation, the SNPs in the known variants file will be masked (skipped over?), while all SNPs that mismatch but were not found in the known variants file will then be further analyzed (machine learning?) and given a new quality score. When trying to understand this process, it seems like I’m just recalibrating SNPs that were not in my known variants file, but my known variants file is full of SNPs that were detected without recalibration lol. Furthermore, if I were using a set of dbSNPs, it seems like I would be biasing myself even more, and making it more likely to call a SNP of a population related to those dbSNPs. I don’t have a strong background in stats, so I know I must be missing something, or misunderstanding something important! I think it may have something to do with something on one of your forums about the machine learning looking for systematic errors, but it still seems like I am putting in a lot of bias. I hope this makes sense and wasn’t too confusing! Any help to make me better understand how this process works without creating bias is greatly appreciated!!! Thank you :)

Best Answer

Answers

  • lfalllfall Member

    Thank you very much! This was very helpful :)

Sign In or Register to comment.