Any ploidy goes!

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2014 in Announcements

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

  • When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

  • The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

Post edited by Geraldine_VdAuwera on

Comments

  • charlesbaudocharlesbaudo MissouriMember
    edited August 2014

    I'm in the process of comparing variant calls made by UnifiedGenotyper and this nightly build of HaplotypeCaller on 7 samples of a haploid organism.

    Here is my UnifiedGenotyper command: java -Xmx6g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R /path/ref.fasta -I /path/7_sample_merge.bam -glm BOTH -ploidy 7 -o 7_sample.raw.vcf

    and here is my HaplotypeCaller command: java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R /path/ref.fasta -I /path/sample1.bam -I /path/sample2.bam -I /path/sample3.bam -I /path/sample4.bam -I /path/sample5.bam -I /path/sample6.bam -I /path/sample7.bam -stand_call_conf 30 -stand_emit_conf 10 -ploidy 7 -o 7_samples.raw.vcf

    The terminal printout for UG shows the number of processed active regions , while the HC does not. I see this phenomenon when I have a single BAM input and ploidy specified as 1 and with two BAM inputs and ploidy specified as 2. Disclaimer: I have never used HaplotypeCaller before as I solely work on a haploid organism nor have I attempted to HC on diploid data. Apologies if this is the intended behavior, but I thought I should inquire about it.

    UG:

    INFO 09:47:01,777 ProgressMeter - Supercontig_6:765865 3.27e+07 48.5 h 88.8 m 79.7% 60.8 h 12.4 h

    INFO 09:48:01,782 ProgressMeter - Supercontig_6:776249 3.28e+07 48.5 h 88.8 m 79.7% 60.8 h 12.3 h

    Nightly HC:

    INFO 09:55:14,091 ProgressMeter - Supercontig_1:5972736 0.0 19.4 h 15250.3 w 14.5% 5.6 d 4.8 d

    INFO 09:56:14,095 ProgressMeter - Supercontig_1:5976722 0.0 19.5 h 15250.3 w 14.5% 5.6 d 4.8 d

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @charlesbaudo,

    First, thanks for being willing to give this a try!

    The difference in the ProgressMeter behavior that you're seeing is expected. It's due to fundamental differences in how UG and HC traverse the data. UG traverses the genome one individual position at a time, so it's easier to track its progress, and it goes through each unit of data much faster than HC. In contrast, HC traverses the data by ActiveRegions, which are stretches of genome that can go up to several hundred bases (or more if you set the max higher). Each unit takes much longer to process since it contains more data, and so you may not see the count go up for a while (although it should move eventually, unless something is broken that I'm not aware of).

    That said I realize I wasn't clear about the usage of the ploidy argument. If you have 7 samples that are clearly separated with distinct read groups (as opposed to pooled samples which are not identified) you need to set the ploidy value to be the actual ploidy of a single sample, as opposed to the aggregate ploidy over all samples. So here you would set the ploidy to 1 for your haploid samples, otherwise you're asking HC to consider each sample as a heptaploid. This should speed up the process as HC will have considerably less work to do.

    I might also recommend doing a single-sample to single-sample comparison first before diving into multi-sample comparisons, but that's up to you.

  • charlesbaudocharlesbaudo MissouriMember

    Thank you for your quick response, @Geraldine_VdAuwera. To clarify, if I have merged those same 7 BAM files I should still set ploidy to 1 if they have distinct read groups? And this recommendation applies to both HC and UG? Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    As long as the samples have distinct read groups (specifically, distinct SM tags) it makes no difference whether they are in the same file or in separate files. The GATK engine merges all the data when it reads in the file contents, and distinguishes samples by SM tags. This does apply to both UG and HC, yes.

  • charlesbaudocharlesbaudo MissouriMember

    Thank you for the clarification. Cheers.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Updated the text to reflect added support for GVCF tools, with stated limitations.

  • travcolliertravcollier UC DavisMember
    edited October 2014

    The docs for HapylotypeCaller (from the nightly build 2014-10-14-gf24cf57) still say that it only supports a ploidy value of 2 under the 'Caveats' section.
    That should probably get updated too.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Indeed, thanks for pointing it out.

  • Does GenotypeGVCFs in the latest nightly build have features beyond v3.3-0, with respect to mixed ploidy across samples/positions? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Actually, version 3.3 supports mixed ploidy; this post refers to 3.2. I'll add a note. Meanwhile, see https://www.broadinstitute.org/gatk/blog?id=4741 for the mixed ploidy announcement.

  • sepidsepid UCLAMember

    May I ask you whether you know any polyploidy (fragments) dataset available publicly?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I'm afraid not, @sepid, sorry. Maybe SeqAnswers or BioStars would know.

  • sepidsepid UCLAMember

    I have a question about how the algorithm of GATK-Polyploidy & mix-Ploidy work? have you guys published it or have you released the details of your algorithm? I very much appreciate your response!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It hasn't been published yet, sorry. But you can view the code here if you want.

  • KatieKatie United StatesMember

    I have a question about specifying ploidy for population sequence data. I am conducting a population sequencing study on a vector-borne bacteria, so each genomic library I have is the bacterial population infecting a single vector, but constitutes a pool of an unknown number (likely 1-7) different bacterial clones. I am planning to use Haplotype Caller to call variants but am wondering how to deal with ploidy when I have an unknown ploidy (or number of coinfecting bacterial strains) present within each genomic library. Should I set the ploidy to 7 (the maximum we likely will see in nature) for all samples and call them all together?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Katie There is no absolute answer to your question; it's going to be a tradeoff between sensitivity (aim to detect every variant, even if present only as a minority fraction) and specificity (aim to reduce false positives resulting from sequencing noise/artifacts). If you want to maximize sensitivity, you'll need to set the ploidy to the highest likely number of clones in your population.

  • KatieKatie United StatesMember

    Okay, thanks for getting back to me so quickly!

  • KatieKatie United StatesMember

    One more question about setting ploidy for a haploid sample with an unknown number of clones present at unknown proportions (i.e. unlikely they are present at equal proportions). If I set ploidy at 10 because that's the highest likely number of clones in my population, does the variant detecting algorithm assume that each clone should be present at 10%? In other words, am I then limiting my sensitivity to any variant found in > 10% of the total population? Or instead, is does the ploidy argument limit the number of unique alleles at a locus? HaplotypeCaller is very slow with high ploidy, but I would like to maintain high sensitivity to minority variants.
    Thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Katie Your first supposition is correct -- and yes that would limit your sensitivity to minority variants. Unfortunately the HC was not designed to handle the use case of indeterminate pools.

  • KatieKatie United StatesMember

    Thank you for your help on this. It seems like variant calling may be a two step process:
    (1) to identify variants informative of differentiation between pooled samples, set PLOIDY=1 (call consensus allele at each locus) and run HaplotypeCaller on all BAM files which may or may not contain mixed samples.
    (2) to identify within-host polymorphisms, I will then set PLOIDY=100 and run HaplotypeCaller on each BAM file individually. (Calling variants on all samples at once with high PLOIDY is computationally infeasible.)

    In this case, can I leave the HAPLOTYPE argument at its default value, because this will not have as strong an effect on sensitivity to minority variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's an interesting strategy, worth exploring.

    Using ploidy 1 seems a bit radical for the first step -- this will favor sites where your population is very homogeneous. Sites with lots of heterogeneity may still be called but with low confidence, so you'd have to make sure to lower the emit thresholds. Or use the GVCF mode of HC since it sounds like what you'll be looking for are sites with low probability of being hom-ref, and potentially multiple alternate alleles. You can then select these out to serve as list of interesting sites.

    Assuming you plan to run step 2 on just the sites called in step 1, note that you can do this by passing in your vcf using -L, and add padding (optional but recommended) using -ip 50.?

    But ploidy 100 may prove challenging in terms of performance. The number of calculations that need to be done to evaluate all possible combinations is astronomical. In future this may be more feasible if we can triage the combinations that need to be calculated, but with the current model that is not done. I'm afraid your job might run forever... You may have to compromise and use a lower ploidy even for that second step.

  • KatieKatie United StatesMember

    Thank you. Using the GVCF mode makes a lot more sense so that I can separate single sample variant calling and then conduct multi-sample variant discovery to identify different subsets of variants which may inform (1) between sample variation and (2) to identify within-host variants.

    However, I ran into trouble using the GVCF mode of HC using the following input:

    java -d64 -Xmx20g -jar "/home/bioinfo/software/GATK-3.3/target/GenomeAnalysisTK.jar" \
    -T HaplotypeCaller \
    -R $REF \
    -I $BAM \
    --emitRefConfidence GVCF \
    -variant_index_type LINEAR \
    -variant_index_parameter 128000 \
    -nct 24 \
    --max_alternate_alleles 3 \
    -L $INTERVAL \
    --sample_ploidy 30 \
    -o ${OUTDIR}${BASE}${INT}"_"${PLOIDY}".g.vcf"

    ERROR MESSAGE: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 30

    When I set the ploidy at 20, I got the error:
    A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec)
    ERROR MESSAGE: 20

    I am using GATK version 3.3-0. Is this an issue with the version? Thank you for your help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Katie
    Hi,

    It is possible it is an issue with the version. I forgot which exact version ploidy was introduced, but if you use the latest version you should have no problem.

    -Sheila

  • KatieKatie United StatesMember

    Great, issue resolved with the latest version. Thank you!

  • KatieKatie United StatesMember

    One more ploidy question: the GenotypeGVCFs tool returns no output when I test it on a list of gVCFs called with a ploidy of 30. I have tested several iterations and get no error messages, but the output file only includes the VCF header. The gVCF files look reasonable.

    java -d64 -Xmx20g -jar "/home/ksw9/bin/GenomeAnalysisTK.jar" -T GenotypeGVCFs -R $REF --sample_ploidy 30 -V gVCFTest.list -o test.raw.vcf

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Are you sure the run completed? If yes, try again with a lower emit_conf_threshold. Check your GVCFs to see what is the range of quals of your calls.

    Note that you don't need to specify ploidy to GenotypeGVCFs, it will infer it from the genotypes in your GVCFs.

  • KatieKatie United StatesMember

    Thank you. Yes, the issue is that if I run HaplotypeCaller with PLOIDY=30, all GQ scores are extremely low, due to low confidence in calling 30 alleles. When I set a lower ploidy, GQ scores increase (slightly) and it is possible to use the GenotypeGVCF tool. Even though GQ scores are low, this will help me identify sites that show evidence of within-sample polymorphism.
    Thanks for the help!

  • RebsRebs Member

    Hi I have a question related to ploidy in UnifiedGenotyper. I am using gatk version 3.8.1 and working with P. falciparum.
    I have a file with previously to sequencing pooled samples (they are not identified). The number of samples is known but the number of parasites per sample is unknown. Also, the parasite has stages where it is haploid and others where it is diploid. Which ploidy should I set when calling variants?
    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Rebs Based on what you describe there is no "correct" setting for the ploidy. One possible approach is to choose what is the minimum allele fraction you aim to be able to detect, and set ploidy accordingly.

    Alternatively you could try using the Mutect2 somatic caller (in tumor-only mode) which does not make ploidy assumptions.

  • RebsRebs Member

    Hi @Geraldine_VdAuwera
    Thanks for your reply.

    What do you mean with allele fraction?
    Also, I performed variant calling trying different ploidys and when setting it to 1, it detects less SNPs...

    Regarding Mutect2, I would prefer to stick to UnifiedGenotyper. It was recommended by some colleagues of mine which also work with the same organism (FYI they used ploidy 2 but their conditions were different from mine).

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Imagine you have a pool of 10 diploid individuals and one of them has a heterozygous SNP. Assuming perfect sequence quality, what you expect to see in the data is that the variant allele will be represented by one out of every 20 reads -- an allele fraction of 0.05. Considering the regular occurrence of technical artifacts, we're not going to want to trust an allele that's represented by very few reads. So let's say we're arbitrarily setting that limit at 5 (there are better ways to do this; this is a strawman) -- that means you'll need to have ~100x coverage at this site to make a confident call.

    Now in your case you already have the data so you have to work backward from however much coverage you do have. Let's say it's ~50x coverage -- that means you can only really expect to make confident calls for variants with an observed allele fraction of 0.1, ie one hom-alt in 10 diploid samples or one heterozygote in 5 diploid samples.

    Does that make sense?

    Re: what you see with ploidy =1, that's expected. You're throwing out anything that doesn't look like a hom-alt call.

Sign In or Register to comment.