How does the BQSR step not create bias in SNP detection?
I am using the GATK best practices to call variants in my RNA-seq data. So far, I have completed all of the steps up to the base recalibration (I skipped the optional indel step). I have been doing a lot of reading on the forum to try to understand the BQSR step. I do not have a set of known variants, so I will need to do the bootstrapping method you described in order to complete the BQSR steps. I understand how this process works, as in how to do the SNP calling and then use the passed reads as the “known variants” input vcf and repeat to convergence. However, I am having trouble understanding how I am not creating a huge amount of bias. From what I understood in the BQSR documentation, the SNPs in the known variants file will be masked (skipped over?), while all SNPs that mismatch but were not found in the known variants file will then be further analyzed (machine learning?) and given a new quality score. When trying to understand this process, it seems like I’m just recalibrating SNPs that were not in my known variants file, but my known variants file is full of SNPs that were detected without recalibration lol. Furthermore, if I were using a set of dbSNPs, it seems like I would be biasing myself even more, and making it more likely to call a SNP of a population related to those dbSNPs. I don’t have a strong background in stats, so I know I must be missing something, or misunderstanding something important! I think it may have something to do with something on one of your forums about the machine learning looking for systematic errors, but it still seems like I am putting in a lot of bias. I hope this makes sense and wasn’t too confusing! Any help to make me better understand how this process works without creating bias is greatly appreciated!!! Thank you