The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at

Quality Score Recalibration for Non-Model Organisms

kmhernankmhernan Chicago, ILMember Posts: 29

I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.

I have a simple python script for this located here

usage: [-h] -i INFILE -o OUTDIR [--ploidy PLOIDY] [--GQ GQ]
                          [--percentile PERCENTILE]

Split multi-sample VCFs into single sample VCFs of high quality SNPs.

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Multi-sample VCF file
  -o OUTDIR, --outdir OUTDIR
                        Directory to output HQ VCF files.
  --ploidy PLOIDY       1 for haploid; 2 for diploid
  --GQ GQ               Filters out variants with GQ < this limit.
  --percentile PERCENTILE
                        Reduces to variants with ALTQ > this percentile.

Thoughts? Concerns? Perhaps I'm going about this in a completely wrong way?



  • glowgooseglowgoose University of Texas at AustinMember Posts: 21

    I used Kyle's script with great success, thanks, Kyle!
    I have a follow-up for this question, working with non-model organisms. The problem number two is how to recalibrate variant quality scores, VQSR. My solution is to use replicates (we can afford them since we work with RAD, not whole-genome resequencing) and use SNPs that are consistently reproducible among duplicated individuals as the "true" SNPs for VQSR. The extractor script is here:
    The script will print its usage info if run without arguments. For making a "truth" dataset for VQSR, you would want to run it with options polyonly=1 (to extract only SNPs that are polymorphic among duplicated individuals) and falt=0.15 (fraction of alternative allele >=0.15). It seems like having 6 pairs of duplicates is sufficient; although next time I would rather duplicate 10 individuals from different populations.

    Additional useful script that I've cobbled together for working with replicates calculates how well the replicates match each other after all the filtering and recalibration, to estimate the overall sensitivity and accuracy. Among other metrics, it will calculate the fraction of detected heterozygotes per replicate (detecting heterozygotes is always a problem when working with low-coverage data). The script is here:

    Please feel free to email me directly with questions (and bugs): Mikhail Matz,

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,130 admin

    Hi @kmhernan,

    Your approach is reasonable, although I would comment that it is not really necessary to split your HQ variants into per-sample file. You can use a population-based callset to recalibrate base qualities in a sample; the known sites don't need to be specific to that sample. As a result, you don't need to use GQ as a filter.

    @glowgoose, thanks for contributing your scripts for working with replicates. We'd be interested in hearing from you and others on how this performs compared to other approaches (hard filtering etc).

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.