Creating Variant Validation Sets - RETIRED
Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.
ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.
For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.
Sample and Frequency Restrictions
The -sampleMode argument controls the mode of sample-based site consideration. The options are:
- None: All sites are included for consideration, including reference sites
- Poly_based_on_gt: Site is included if it has a variant genotype in at least one of the selected samples
- Poly_based_on_gl: Site is included if it is likely to be variant based on the genotype likelihoods of the selected samples
Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95
The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are:
- Uniform: Choose variants uniformly, without regard to their allele frequency.
- Keep AF Spectrum: Choose variants so that the resulting allele frequency matches as closely as possible to that of the input VCF.