Hi. I am sequencing the genomes of a panel of bacteria strains (Mycobacterium smegmatis) and aligning them to a reference strain. Some of my strains are very similar to the reference (estimate of 100 variants across the genome) and some are more diverged (estimate of 10,000 variants across the genome). I am interested in using GATK to identify the SNPs and indels present in these strains.
My question is about generating a list of “known” SNPs for the BaseRecalibration tool. I have generated an initial list of high confidence SNPs from the low-variance strains (100 variants) and will feed this as a VCF file to the base quality score recalibrator. This should work well to refine the base quality scores for the low-variance strains.
But what about the high variance strains? My concern is that, if I use the same VCF file (derived from the low-variance strains) to do base recalibration on sequencing runs on high-variance strains, I might introduce error – because all of the real variants in the high-variance strain will be assumed to be sequencing errors based on the provided VCF file.
Does this mean that I will have to generate a provisional “known” SNPs VCF file for every strain?
Thank you very much for the advice!