The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Can I use GATK on non-diploid organisms?

Posts: 71GATK Developer mod
edited September 2013 in FAQs

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

Since version 2.0, the UnifiedGenotyper has been able to deal with ploidies other than two. Three use cases are currently supported:

1. Native variant calling in haploid or polyploid organisms.
2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
3. Pooled validation/genotyping at known sites.

In order to enable this feature, you need to set the -ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

Note that all other UnifiedGenotyper arguments work in the same way.

A full minimal command line would look for example like

java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-T UnifiedGenotyper \
-ploidy 4


The glm argument works in the same way as in the diploid case - set to [INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.

## Current Limitations

Many of these limitations will be gradually removed over time, but for now please keep these in mind.

• Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.

• Some annotations do not work in non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

• The HaplotypeCaller and ReduceReads currently do not support non-diploid data.

• In theory you can use VQSR to filter non-diploid calls, but we currently have no experience with this and therefore cannot offer any support nor best practices on how to do this.

• For indels, only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 1Member

As GATK2 can handle Mitochondrial DNA, is there a recommended ploidy setting for human Mitochondria? I understand that mtDNA can vary dramatically in how many copies are present in a cell, but is there some sort of consensus value? (e.g. some sort of function of mean coverage)

Thank-you very much!

• Posts: 71GATK Developer mod

We've experimented with 50 to 100 but we make no optimality claims on that - probably a better number would be the ratio of (mean coverage in the MT contig) / (mean coverage in somatic chromosomes)

• Posts: 3Member

@delangel are there any other recommended settings for MT with GATK?

• Posts: 3Member
edited September 2012

@delangel How does UG use this ploidy information for calling variants in MT? For SNPs at any position we dont expect more than 4 alleles (ATGC). In our low-pass data we have 5-7X coverage overall, and ~700X in case of mitochondria.

Post edited by sahiilseth on
• Posts: 71GATK Developer mod

It's internal machinery needs to know the organism ploidy (i.e. number of chromosomes inside) to work well (btw number of possible different alleles is different than ploidy). Given your coverage I'd start with -ploidy 100 or so