This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
What types of variants can GATK tools detect / handle?
The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation. Let's break it down -- and while we're at it, call out the specific tools and workflows. See the Best Practices docs for details.
Germline short variants: SNPs and Indels
- Joint calling on multiple samples with HaplotypeCaller and GenotypeGVCFs + filtering with VQSR
- Single-sample calling with HaplotypeCaller + filtering with GATK CNN
The HaplotypeCaller and GenotypeGVCFs are sophisticated germline short variant calling tools that can model SNPs and indels simultaneously. So they are capable of emitting mixed records by default, as well as symbolic representations for e.g. spanning deletions. They also emits short-range physical phasing information. However, they do not emit MNPs. If you would like to combine contiguous SNPs into MNPs, you will need to use the legacy ReadBackedPhasing tool in GATK3 with the MNP merging function activated. See the GATK3 tool documentation for details.
Our older (and now definitively retired) variant caller, UnifiedGenotyper, was even more limited. It called SNPs and indels separately (even if you ran in calling mode BOTH, the program performed separate calling operations internally) so it was not able to recognize when SNPs and indels should be emitted together as a joint record when they occur at the same site, nor when they were incompatible and only one could be correct.
Somatic short variants: SNVs and Indels
- Tumor-Normal pair with Mutect2
- Tumor-only with Mutect2
Mutect2 is a variant caller that is based on the original award-winning SNV caller Mutect, hybridized with HaplotypeCaller to enable sensitive indel calling. Mutect2 in GATK4 is very different from the early version that was included in GATK3; the GATK3 version should no longer be used.
Germline Copy Number Variants (CNVs)
- Rare and common germline CNVs in multiple samples
Somatic Copy Number Alterations (CNAs)
- Tumor-Normal pair with ModelSegments
- Tumor-only with ModelSegments
Germline Structural Variants (SVs)
Development is underway; stay tuned for updates.
GATK and Picard variant manipulation tools are currently able to recognize the following types of alleles:
- SNP (single nucleotide polymorphism)
- INDEL (insertion/deletion)
- MIXED (combination of SNPs and indels at a single position)
- MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
- SYMBOLIC (such as the
<NON-REF>allele used in GVCFs produced by HaplotypeCaller, the
*allele used to signify the presence of a spanning deletion, or undefined events like a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)
Note that SelectVariants, the GATK tool most used for VCF subsetting operations, discriminates strictly between these categories. This means that if you use for example
INDEL to pull out indels, it will only select pure INDEL records, excluding any MIXED records that might include a SNP allele in addition to the insertion or deletion alleles of interest. To include those you would have to also specify
selectType MIXED in the same command.