Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." This is an issue that our forum vendors are working on fixing. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).

Creating Variant Validation Sets - RETIRED

delangeldelangel Broad InstituteMember
edited May 2015 in Archive

Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.



ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.

GATK Documentation

For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.

Sample and Frequency Restrictions


The -sampleMode argument controls the mode of sample-based site consideration. The options are:

  • None: All sites are included for consideration, including reference sites
  • Poly_based_on_gt: Site is included if it has a variant genotype in at least one of the selected samples
  • Poly_based_on_gl: Site is included if it is likely to be variant based on the genotype likelihoods of the selected samples


Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95


The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are:

  • Uniform: Choose variants uniformly, without regard to their allele frequency.
  • Keep AF Spectrum: Choose variants so that the resulting allele frequency matches as closely as possible to that of the input VCF.
Post edited by Geraldine_VdAuwera on
This discussion has been closed.