The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.
Hi. I am sequencing the genomes of a panel of bacteria strains (Mycobacterium smegmatis) and aligning them to a reference strain. Some of my strains are very similar to the reference (estimate of 100 variants across the genome) and some are more diverged (estimate of 10,000 variants across the genome). I am interested in using GATK to identify the SNPs and indels present in these strains.
My question is about generating a list of “known” SNPs for the BaseRecalibration tool. I have generated an initial list of high confidence SNPs from the low-variance strains (100 variants) and will feed this as a VCF file to the base quality score recalibrator. This should work well to refine the base quality scores for the low-variance strains.
But what about the high variance strains? My concern is that, if I use the same VCF file (derived from the low-variance strains) to do base recalibration on sequencing runs on high-variance strains, I might introduce error – because all of the real variants in the high-variance strain will be assumed to be sequencing errors based on the provided VCF file.
Does this mean that I will have to generate a provisional “known” SNPs VCF file for every strain?
Thank you very much for the advice!