The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

Quality Score Recalibration for Non-Model Organisms

I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.

I have a simple python script for this located here

usage: GetHighQualVcfs.py [-h] -i INFILE -o OUTDIR [--ploidy PLOIDY] [--GQ GQ]
                          [--percentile PERCENTILE]

Split multi-sample VCFs into single sample VCFs of high quality SNPs.

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Multi-sample VCF file
  -o OUTDIR, --outdir OUTDIR
                        Directory to output HQ VCF files.
  --ploidy PLOIDY       1 for haploid; 2 for diploid
  --GQ GQ               Filters out variants with GQ < this limit.
  --percentile PERCENTILE
                        Reduces to variants with ALTQ > this percentile.

Thoughts? Concerns? Perhaps I'm going about this in a completely wrong way?

Tagged:

Comments

  • glowgooseglowgoose University of Texas at AustinMember

    I used Kyle's script with great success, thanks, Kyle!
    I have a follow-up for this question, working with non-model organisms. The problem number two is how to recalibrate variant quality scores, VQSR. My solution is to use replicates (we can afford them since we work with RAD, not whole-genome resequencing) and use SNPs that are consistently reproducible among duplicated individuals as the "true" SNPs for VQSR. The extractor script is here: https://dl.dropboxusercontent.com/u/37523721/replicatesMatch.pl
    The script will print its usage info if run without arguments. For making a "truth" dataset for VQSR, you would want to run it with options polyonly=1 (to extract only SNPs that are polymorphic among duplicated individuals) and falt=0.15 (fraction of alternative allele >=0.15). It seems like having 6 pairs of duplicates is sufficient; although next time I would rather duplicate 10 individuals from different populations.

    Additional useful script that I've cobbled together for working with replicates calculates how well the replicates match each other after all the filtering and recalibration, to estimate the overall sensitivity and accuracy. Among other metrics, it will calculate the fraction of detected heterozygotes per replicate (detecting heterozygotes is always a problem when working with low-coverage data). The script is here:
    https://dl.dropboxusercontent.com/u/37523721/repMatchStats.pl

    Please feel free to email me directly with questions (and bugs): Mikhail Matz, matz@utexas.edu

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @kmhernan,

    Your approach is reasonable, although I would comment that it is not really necessary to split your HQ variants into per-sample file. You can use a population-based callset to recalibrate base qualities in a sample; the known sites don't need to be specific to that sample. As a result, you don't need to use GQ as a filter.

    @glowgoose, thanks for contributing your scripts for working with replicates. We'd be interested in hearing from you and others on how this performs compared to other approaches (hard filtering etc).

Sign In or Register to comment.