The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

Genotype and Validate

delangeldelangel Broad InstituteMember
edited May 2015 in Archive

Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.


Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.

The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.

Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.

Command-line arguments

Usage of GenotypeAndValidate and its command line arguments are described here.

The VCF Annotations

The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.

The Outputs

GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:

ALT REF Predictive Value
called alt True Positive (TP) False Positive (FP) Positive PV
called ref False Negative (FN) True Negative (TN) Negative PV

The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.

The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.

The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).

Additional Details

  • You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)

  • The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).


Genotypes BAM file from new technology using the VCF as a truth dataset:

java \
    -jar /GenomeAnalysisTK.jar \
    -T  GenotypeAndValidate \
    -R human_g1k_v37.fasta \
    -I myNewTechReads.bam \
    -alleles handAnnotatedVCF.vcf \
    -BTI alleles \
    -o gav.vcf

An annotated VCF example (info field clipped for clarity)

1   20568807    .   C   T   0    HapMapHet        AC=1;AF=0.50;AN=2;DP=0;GV=T  GT  0/1
1   22359922    .   T   C   282  WG-CG-HiSeq      AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ  1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99    ./.
13  102391461   .   G   A   341  Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ  ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99   ./.
1   175516757   .   C   G   655  SnpCluster,WG    AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ  ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99   ./.

Using a BAM file as the truth dataset:

java \
    -jar /GenomeAnalysisTK.jar \
    -T  GenotypeAndValidate \
    -R human_g1k_v37.fasta \
    -I myTruthDataset.bam \
    -alleles callsToValidate.vcf \
    -BTI alleles \
    -bt \
    -o gav.vcf

Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:

PacBio PbGenotypeAndValidate results

Post edited by Geraldine_VdAuwera on
This discussion has been closed.