The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

Creating Variant Validation Sets - RETIRED

delangeldelangel Broad InstituteMember
edited May 2015 in Archive

Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.



ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.

GATK Documentation

For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.

Sample and Frequency Restrictions


The -sampleMode argument controls the mode of sample-based site consideration. The options are:

  • None: All sites are included for consideration, including reference sites
  • Poly_based_on_gt: Site is included if it has a variant genotype in at least one of the selected samples
  • Poly_based_on_gl: Site is included if it is likely to be variant based on the genotype likelihoods of the selected samples


Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95


The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are:

  • Uniform: Choose variants uniformly, without regard to their allele frequency.
  • Keep AF Spectrum: Choose variants so that the resulting allele frequency matches as closely as possible to that of the input VCF.
Post edited by Geraldine_VdAuwera on
This discussion has been closed.