The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

# Quality Score Recalibration for Non-Model Organisms

Posts: 10Member

I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.

I have a simple python script for this located here

usage: GetHighQualVcfs.py [-h] -i INFILE -o OUTDIR [--ploidy PLOIDY] [--GQ GQ]
[--percentile PERCENTILE]

Split multi-sample VCFs into single sample VCFs of high quality SNPs.

optional arguments:
-h, --help            show this help message and exit
-i INFILE, --infile INFILE
Multi-sample VCF file
-o OUTDIR, --outdir OUTDIR
Directory to output HQ VCF files.
--ploidy PLOIDY       1 for haploid; 2 for diploid
--GQ GQ               Filters out variants with GQ < this limit.
--percentile PERCENTILE
Reduces to variants with ALTQ > this percentile.


Tagged:

• University of Texas at AustinPosts: 21Member

I used Kyle's script with great success, thanks, Kyle! I have a follow-up for this question, working with non-model organisms. The problem number two is how to recalibrate variant quality scores, VQSR. My solution is to use replicates (we can afford them since we work with RAD, not whole-genome resequencing) and use SNPs that are consistently reproducible among duplicated individuals as the "true" SNPs for VQSR. The extractor script is here: https://dl.dropboxusercontent.com/u/37523721/replicatesMatch.pl The script will print its usage info if run without arguments. For making a "truth" dataset for VQSR, you would want to run it with options polyonly=1 (to extract only SNPs that are polymorphic among duplicated individuals) and falt=0.15 (fraction of alternative allele >=0.15). It seems like having 6 pairs of duplicates is sufficient; although next time I would rather duplicate 10 individuals from different populations.

Additional useful script that I've cobbled together for working with replicates calculates how well the replicates match each other after all the filtering and recalibration, to estimate the overall sensitivity and accuracy. Among other metrics, it will calculate the fraction of detected heterozygotes per replicate (detecting heterozygotes is always a problem when working with low-coverage data). The script is here: https://dl.dropboxusercontent.com/u/37523721/repMatchStats.pl

Please feel free to email me directly with questions (and bugs): Mikhail Matz, matz@utexas.edu