Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Base Recalibration

rockrock Harvard School of Public HealthMember

Hi. I am sequencing the genomes of a panel of bacteria strains (Mycobacterium smegmatis) and aligning them to a reference strain. Some of my strains are very similar to the reference (estimate of 100 variants across the genome) and some are more diverged (estimate of 10,000 variants across the genome). I am interested in using GATK to identify the SNPs and indels present in these strains.

My question is about generating a list of “known” SNPs for the BaseRecalibration tool. I have generated an initial list of high confidence SNPs from the low-variance strains (100 variants) and will feed this as a VCF file to the base quality score recalibrator. This should work well to refine the base quality scores for the low-variance strains.

But what about the high variance strains? My concern is that, if I use the same VCF file (derived from the low-variance strains) to do base recalibration on sequencing runs on high-variance strains, I might introduce error – because all of the real variants in the high-variance strain will be assumed to be sequencing errors based on the provided VCF file.

Does this mean that I will have to generate a provisional “known” SNPs VCF file for every strain?

Thank you very much for the advice!


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @rock,

    You are correct that using the same set of knowns on the high-variance strains may be problematic. Unfortunately we do not have any experience dealing with this, as we only work with human genomes, so the recommendations we can give you are limited. I do think the safest option is to bootstrap a set of knowns for each of the high-v strains, but feel free to experiment with different groupings.

  • rockrock Harvard School of Public HealthMember

    Thanks Geraldine- will do!

Sign In or Register to comment.