What types of variants can GATK tools detect / handle?
The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation. Let's break it down -- and while we're at it, call out the specific tools and workflows. See the Best Practices docs for details.
Short variants: SNPs/SNVs and Indels
The HaplotypeCaller and GenotypeGVCFs are sophisticated germline short variant calling tools that can model SNPs and indels simultaneously. So they are capable of emitting mixed records by default, as well as symbolic representations for e.g. spanning deletions. They also emits short-range physical phasing information. However, they do not emit MNPs. If you would like to combine contiguous SNPs into MNPs, you will need to use the legacy ReadBackedPhasing tool in GATK3 with the MNP merging function activated. See the GATK3 tool documentation for details.
Our older (and now definitively retired) variant caller, UnifiedGenotyper, was even more limited. It called SNPs and indels separately (even if you ran in calling mode BOTH, the program performed separate calling operations internally) so it was not able to recognize when SNPs and indels should be emitted together as a joint record when they occur at the same site, nor when they were incompatible and only one could be correct.
Mutect2 is a new variant caller that is based on the original award-winning SNV caller Mutect, hybridized with HaplotypeCaller to enable sensitive indel calling.
Copy Number Variants (CNVs)
We have some tools for gCNV discovery currently in beta status; expecting a release in the first quarter of 2018.
We have a new production-ready pipeline for somatic CNV discovery that performs well on both exomes and WGS.
Structural Variants (SVs)
Development is underway. In the meantime, there is also a third-party software package called GenomeSTRiP built on top of GATK that provides SV (structural variation) analysis capabilities.
GATK and Picard variant manipulation tools are currently able to recognize the following types of alleles:
- SNP (single nucleotide polymorphism)
- INDEL (insertion/deletion)
- MIXED (combination of SNPs and indels at a single position)
- MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
- SYMBOLIC (such as the
<NON-REF>allele used in GVCFs produced by HaplotypeCaller, the
*allele used to signify the presence of a spanning deletion, or undefined events like a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)
Note that SelectVariants, the GATK tool most used for VCF subsetting operations, discriminates strictly between these categories. This means that if you use for example
INDEL to pull out indels, it will only select pure INDEL records, excluding any MIXED records that might include a SNP allele in addition to the insertion or deletion alleles of interest. To include those you would have to also specify
selectType MIXED in the same command.