GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# Creating Variant Validation Sets - RETIRED

edited May 2015 in Archive

## Introduction

ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.

## GATK Documentation

For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.

## Sample and Frequency Restrictions

### -sampleMode

The -sampleMode argument controls the mode of sample-based site consideration. The options are:

• None: All sites are included for consideration, including reference sites
• Poly_based_on_gt: Site is included if it has a variant genotype in at least one of the selected samples
• Poly_based_on_gl: Site is included if it is likely to be variant based on the genotype likelihoods of the selected samples

### -samplePNonref

Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95

### -frequencySelectionMode

The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are:

• Uniform: Choose variants uniformly, without regard to their allele frequency.
• Keep AF Spectrum: Choose variants so that the resulting allele frequency matches as closely as possible to that of the input VCF.
Post edited by Geraldine_VdAuwera on
