Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

HaplotypeCaller pooled sequence problem

Hi,

I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days:

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: the combination of ploidy (180) and number of alleles (9) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus
ERROR ------------------------------------------------------------------------------------------

and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway?

My code is below, and any help would be appreciated.

java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller
-nct 6 \
-R ~/my_ref_sequence \
--intervals ~/my_intervals_file \
-ploidy 180 \
-log my_log_file \
-I ~/my_input_bam \
-o ~/my_output_vcf

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm not sure why it's using more alt alleles than the default max. Try setting it to something lower explicitly with the -maxAltAlleles argument.

  • andyktandykt Member

    Thanks, I'll try that.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @andykt said:
    I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals)

    Hi @andykt. I'm a user of the GATK like you. I could be missing something here, but why don't you use a ploidy of 2? Are these humans with some serious chromosomal disorders? I've heard of trisomy, but 180-somy? Maybe I'm just completely misunderstanding what you are trying to achieve. It would not be the first time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @tommycarstensen This is an experimental design called sample pooling where samples are pooled at the library stage (I believe) and sequenced together as a group (without being individually barcoded). This allows you to get more samples sequenced for cheaper and makes sense if you're more interested in what variants are present in a population than precisely what's in individuals. The effective ploidy of a pool is N samples multiplied by the actual ploidy of individuals.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Ahhh, thanks for enlightening me. I didn't know about this type of experimental setup. Very cool. Thanks.

  • andyktandykt Member

    Thanks Geraldine, that's right - in this case @tommycarstensen, I'm interested in knowing allele frequencies within different populations and species of rodents, rather than individuals.

  • andyktandykt Member

    Back to my original problem @Geraldine_VdAuwera, I've tried running HC again after explicitly specifying a -maxAltAlleles of 5, but I am still getting an error regarding number of alleles (7).

    Any other ideas?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @andykt If you specify 5 alternate alleles, you effectively consider 7 alleles since you also have REF and the symbolic NON-REF alleles included. I'm afraid you may need to drop maxAltAlleles even further -- the program wasn't designed with such a large pool in mind. That will limit your ability to discover minority alleles of course. Or you drop the ploidy, which will allow you to find minority alleles, but lower your sensitivity to rare variants.

  • andyktandykt Member

    Thanks @Geraldine_VdAuwera , if I cold just ask for some further clarification - first, I don't think I can drop the ploidy, if I want accurate allele frequencies I think I need to provide HC with an accurate number of individuals? Second,regarding the max number of alleles: I'm only really interested in SNPs, therefore there will generally only ever be a maximum of three alleles (one in the ref sequence, and possibly two within the population I'm sequencing if both differ from the ref). So if I drop the max alleles to 3, will that affect SNP identification?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For SNPs that should be fine, sure.

  • andyktandykt Member

    Hi again @Geraldine_VdAuwera, sorry to bother you again... I'm afraid I'm still getting the same problem, I've now set max alleles to 3 and yet still get the following:

    ERROR MESSAGE: the combination of ploidy (120) and number of alleles (7) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus

    So I guess either the max alleles argument doesn't work, or HC can't yet handle pooled populations (and high ploidy) as well as you guys thought? Would I be better using Unified Genotyper if so?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @andykt,

    Can you please confirm that you ran with max alt alleles set to 3 and it quoted number of alleles as 7? That is not the expected behavior. If that's what's happening, we'll need a snippet of data from you to reproduce the error and debug locally.

    But the wider problem will probably remain -- I'm afraid the ploidies you're after are probably beyond what HC can handle. UG was engineered to some extent to handle pooled experiments (there is a manuscript in prep somewhere describing those capabilities) but HC was never tested in those conditions. The generalization of the ploidy model in HC was mainly put in to please the folks who work on non-diploid organisms, but the devs didn't have pooled experiments in mind, I think, so they never tested ploidies on the order of what you're looking at. In this case I might make an exception to my usual rule and recommend trying UG, indeed... No guarantees, but it's worth a shot while we look into that max alt alleles issue.

  • hin175hin175 ParisMember
    edited February 2015

    Hi,
    facing the same issue with 3 BAM files from 70 pooled diploid individuals, I roughly calculated (using the apparent 2147483647 barrier) that with 5 allies, the max ploidy that can be used is 13... Should I use this ploidy value (considering I mainly interested in fixed SNPs among populations), or should I use UG (but I'm not sure UG can handle such large possible genotypes : 7.17 E97) ?
    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @hin175 having looked into this a little more, I can tell you that the barrier you folks are running into is an arbitrary number that corresponds to the largest value allowed by the Java language for integer numbers. This limit applies equally to UG, so it seems it won't actually matter whether you use HC or UG. So you should lower the ploidy and/or the number of alt alleles depending on what you care most about. Maybe try a few different combinations and see what answers your questions better.

  • hin175hin175 ParisMember

    Thank you for your answer. I will use a ploidy value of 13, and considering the rare variants will (almost) not be called.

  • andyktandykt Member

    Thanks.Unfortunately I don't think reducing the ploidy is an option for me - I'm interested in allele frequencies, and the ploidy is what it is. I think in this case I'll need to use VarScan or FreeBayes rather than GATK, but thanks for your advice. Hopefully it will be possible with you guys in the future.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Keep in mind that when you reduce the ploidy by a known factor, you can still recalculate the allele frequencies quite precisely. The drawback is that you lose sensitivity to very rare events, but that is a common issue with pooled experiments.

    In any case, good luck!

  • rstamrstam MunichMember

    Hi, I have the same problem.

    I tried to limit the numbers of alternate alleles:

    INFO 14:03:16,268 HelpFormatter - Program Args: -T HaplotypeCaller --sample_ploidy 20 --max_alternate_alleles 4 -nct 24 -R Genome.fa -I Sample07.bwa.bam.realigned.bam -L Sample07.bwa.bam.interval_list -o Raw_Variants_Sample07.bwa.bam.vcf

    Somehow it seems to work, kind of. I get these warnings in the log:

    INFO 14:14:38,131 ProgressMeter - scaffold5:164963 61814.0 12.0 m 3.2 h 8.0% 2.5 h 2.3 h
    WARN 14:14:41,755 ExactAFCalculator - this tool is currently set to genotype at most 4 alternate alleles in a given context, but the context at scaffold5:162205 has 5 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

    But then later on, it breaks on a different position:

    ERROR MESSAGE: the combination of ploidy (20) and number of alleles (27) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus

    So this error shows, even when setting max alt alleles to 4.
    Does this mean that GATK still tries to calculate all possible alternate alleles and only after that applies the set limit? That wouldn't make sense, would it?

    Thanks,

    Remco

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Yes, until the latest version (3.7) it still tried to calculate everything -- the limit only affected final output. That was a partial solution that addressed a different if related problem.

    In the latest version we have a better solution -- check out the 3.7 version highlights for a detailed explanation.
  • ambarrioambarrio Member

    Dear @Geraldine_VdAuwera ,

    This development in 3.7 is really exciting for those of us still working with pool-seq designs, very useful. If I follow your reasoning on the Release Notes: https://software.broadinstitute.org/gatk/blog?id=8692, --max_genotype_count will cull now the combination of --max_alternate_alleles and --sample_ploidy.

    Just to fill my curiosity and everyone's else's here. Could you explain how do you calculate the maximum allele count in your example from a ploidy value and genotype set to 1024?

    https://software.broadinstitute.org/gatk/blog?id=8692

    For example, with ploidy 18 and maximum genotype count set to 1024 (the current, arbitrary default value, but definitely reasonable in > most cases), the maximum allele count is 3 (alt allele count 2), potentially much lower than the -maxAltAlleles you requested.

    Is there any formula? I'm not sure if I got it correctly.

    Thanks a lot,
    álvaro

    Issue · Github
    by shlee

    Issue Number
    1745
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited February 2017

    Hi @ambarrio,

    Geraldine is away on a much deserved vacation so I've asked one of our developers to help with your question. Our developer says the following.


    The number of unphased Genotypes (NUG) give ploidy p and allele number a (including reference) can be obtained recursively like so:

    NUG(p, a) = NUG(p, a-1) + NUG(p-1,a) 
    
    Where NUG(0,n) = 1 and NUG(n,1) = 0.
    

    And this seems to be equivalent to

    NUG(p,a) = choose(p+a-1, a-1).
    

    So then we calculate the maximum allele so that NUG(p,a) < the limit.

    a' = max_a NUG(p,a) <= limit.
    

    I hope this answers your question.

  • mjtivmjtiv Newark, DEMember

    Question about ploidy setting? So, lets say you have a pool of 200 samples (samples diploid, so 400 ploidy setting---not running with this ploidy setting) and you are interested in common variants (not super rare variants) for future breeding selection. If you allow the program to run with the standard setting of 2, is that a good or bad idea? However, if you increase ploidy setting to lets say 50, how do you re-caclculate the frequency of alleles called for that setting? Trying to optimize speed/some sensitivity for running such a large pooled population.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mjtiv
    Hi,

    I hope this thread will help.

    -Sheila

  • mjtivmjtiv Newark, DEMember

    @Sheila thank you for the link! However, this now brings up another question based on @Geraldine_VdAuwera (https://gatkforums.broadinstitute.org/dsde/discussion/8407/analysis-of-genome-evolution-over-time-with-gatk). Geralidine mentions that the ploidy level of 21 is HaplotypeCallers prior efficiency maximum. Did this get increased in newer version of GATK4.0? I liked your comments suggestion of 100N (capture 1% level).

  • mjtivmjtiv Newark, DEMember

    Note: The data set I am using is huge and we had to increase heapsize in Picard for mark duplicates to 240GB otherwise it crashed badly.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @mjtiv,

    I'm not aware of any changes to making HaplotypeCaller more efficient for higher ploidies. Instead, for such pooled experiments, perhaps you may consider Mutect2 instead.

Sign In or Register to comment.