The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Can I use GATK on non-diploid organisms?

Posts: 71Dev mod
edited October 2014 in FAQs

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

### Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

### Cases where ploidy needs to be specified

1. Native variant calling in haploid or polyploid organisms.
2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
3. Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

## Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 1Member

As GATK2 can handle Mitochondrial DNA, is there a recommended ploidy setting for human Mitochondria? I understand that mtDNA can vary dramatically in how many copies are present in a cell, but is there some sort of consensus value? (e.g. some sort of function of mean coverage)

Thank-you very much!

• Posts: 71Dev mod

We've experimented with 50 to 100 but we make no optimality claims on that - probably a better number would be the ratio of (mean coverage in the MT contig) / (mean coverage in somatic chromosomes)

• Posts: 3Member

@delangel are there any other recommended settings for MT with GATK?

• Posts: 3Member
edited September 2012

@delangel
How does UG use this ploidy information for calling variants in MT? For SNPs at any position we dont expect more than 4 alleles (ATGC).
In our low-pass data we have 5-7X coverage overall, and ~700X in case of mitochondria.

Post edited by sahiilseth on
• Posts: 71Dev mod

It's internal machinery needs to know the organism ploidy (i.e. number of chromosomes inside) to work well (btw number of possible different alleles is different than ploidy). Given your coverage I'd start with -ploidy 100 or so

What if I want to do multi-sample SNP calling using the unified genotyper on a mixture of haploid and diploid organisms? I have many yeast genomes where most are haploid, but a few are diploid. Will I have to call variants on the two groups separately?

@kjclowers‌

Hi,

You will need to run the UnifiedGenotyper separately for the haploid and diploid genomes. You have to specify which ploidy the UG should expect for each group in your command line, so unfortunately you cannot run a mixture of ploidies.

-Sheila

• United KingdomPosts: 400Member ✭✭✭

@delangel Do you throw your MT (and Y and non-PAR X) variants into VQSR along with your autosomal variants? I'm worried this will bias depth dependent annotations. I'm more keen on doing separate filtering for these variants (at least MT and Y), but I would probably have to do hard filtering, since the number of variants is probably insufficient for VQSR and because not all SNP truth/training sets contain all chromosomes:

hapmap_3.3.b37.vcf.gz 1-22,X,Y,MT
1000G_omni2.5.b37.vcf.gz 1-22,X,Y
1000G_phase1.snps.high_confidence.b37.vcf.gz 1-22,X
dbsnp_138.b37.vcf.gz 1-22,X,Y


I would be very happy to get your insight on this, because the supplementary material to projects such as 1000G phase1 and GoNL is a bit scattered in terms of variant calling and filtering of mtDNA.

Hi Tommy @tommycarstensen‌

I'm not aware that we do any special handling of the non-autosomal components in the production pipeline -- I think the assumption is that there are few enough of them to not affect the rest. But it would be interesting to see a formal comparison. I sure can't imagine that you could get VQSR to run on them alone.

PS: FYI Guillermo (@delangel) has moved on to another job so I don't know if he still gets GATK forum emails -- best not count on it.

Geraldine Van der Auwera, PhD

• United KingdomPosts: 400Member ✭✭✭

@Geraldine_VdAuwera thanks for this. If I do any comparisons, then you will be the second to know. I'm still making changes to my analysis plan. I'm worried that Y and non-PAR X will have true positives filtered out by VQSR, because they are haploid in males and VQSR uses a few depth related annotations. I'll update this thread in time.

• United KingdomPosts: 400Member ✭✭✭

@Geraldine_VdAuwera‌ Just as an aside. The memory use of UG3.3 seems to explode and it seems to stall or only walk slowly, when I run it with -ploidy 1 on chromY and the non-PARs of chromX for male samples. I'm switching to diploidy for the whole genome for samples of both sexes. I am however happy to provide command line etc., if you are eager to troubleshoot. Apologies for the many posts over the past few weeks and months. I've been testing out a lot of things.

@tommycarstensen To be honest we have zero resources available for UG troubleshooting/improvements -- I'm afraid you're on your own on this one.

No worries, your questions have helped us nail down a few issues.

Geraldine Van der Auwera, PhD

• San DiegoPosts: 9Member

Hi Geraldine,

I have a question about using UG or haplotype caller with a haploid organism, but which I have more than one clone in a sample. We generally use UG with ploidy set to two (default) and assume het variants are mixed reads from different clones. My worry is UG expects a certain number of hets to be called. Is this true? I've thought about setting ploidy to e.g. 4 and then saying we can only detect clones present at at least 25% of the sample. What are your thoughts. If you run UG with ploidy=1, will it allow for het calling? Should I try Mutec instead? Thanks!