Base Recalibration

Hi. I am sequencing the genomes of a panel of bacteria strains (Mycobacterium smegmatis) and aligning them to a reference strain. Some of my strains are very similar to the reference (estimate of 100 variants across the genome) and some are more diverged (estimate of 10,000 variants across the genome). I am interested in using GATK to identify the SNPs and indels present in these strains.

My question is about generating a list of “known” SNPs for the BaseRecalibration tool. I have generated an initial list of high confidence SNPs from the low-variance strains (100 variants) and will feed this as a VCF file to the base quality score recalibrator. This should work well to refine the base quality scores for the low-variance strains.

But what about the high variance strains? My concern is that, if I use the same VCF file (derived from the low-variance strains) to do base recalibration on sequencing runs on high-variance strains, I might introduce error – because all of the real variants in the high-variance strain will be assumed to be sequencing errors based on the provided VCF file.

Does this mean that I will have to generate a provisional “known” SNPs VCF file for every strain?

    Hi @rock,

    You are correct that using the same set of knowns on the high-variance strains may be problematic. Unfortunately we do not have any experience dealing with this, as we only work with human genomes, so the recommendations we can give you are limited. I do think the safest option is to bootstrap a set of knowns for each of the high-v strains, but feel free to experiment with different groupings.

    Thanks Geraldine- will do!

