Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Max ploidy for pooled seq HC/UG runs?

Is there an upper limit of cumulative -ploidy to HC or UG runs? HC threw me an error if the cumulative ploidy exceeded 100.
UG keeps ever increasing its runtime if my cumulative ploidy exceeds ~100.
If this is known behavior, please point me to relevant threads. Be nice, please :)
Generally speaking, how would you approach the analysis of multiple pools of 24 diplod individuals (i.e. ploidy per pool = 24 for organelles and 48 for autosomes)?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    There is no hard upper limit at the programmatic level, but as ploidy increases so does the complexity of the calculations involved. We have not tested such high ploidies, so we cannot easily predict at what point you would expect to run into problems. It may also be dependent on your computing infrastructure since performance may be an issue.

    But we can try to figure out what is going wrong in specific cases. Can you tell me what error you got with ploidy > 100?

  • KNSKNS Member

    Well, truth be told I couldn't reproduce the same error for ploidies >100 today - for reasons not clear to me.
    BUT, I could reproduce a ploidy error submitting the following commands:

    java -Xmx20g -jar GenomeAnalysisTK.jar \
    -R some/ref/name.fasta \
    -T HaplotypeCaller \
    -I /some/filtered/dupmarked.bam \
    --sample_ploidy 24 \
    --min_base_quality_score 20 \
    --maxNumHaplotypesInPopulation 24 \
    --min_mapping_quality_score 20 \
    -dcov 10000 \
    --max_alternate_alleles 2 \
    -ERC BP_RESOLUTION \
    -L some/intervals_test.list \
    --doNotRunPhysicalPhasing \
    -nct 6 \
    -pairHMM VECTOR_LOGLESS_CACHING \
    -o /some/output/name.vcf

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 24
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.getIndelPLs(ReferenceConfidenceModel.java:269)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calculateRefConfidence(ReferenceConfidenceModel.java:227)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.referenceModelForNoVariation(HaplotypeCaller.java:1209)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1010)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 24
    ERROR ------------------------------------------------------------------------------------------
  • KNSKNS Member
    edited November 2014

    The massive increase in calculation loads incurred by increasing the ploidy level are becoming prohibitive for me. But of course I am required to produce strong, clean raw data sets.
    With a ploidy of 2, a read coverage around/above 100 and mmq/mbq >20 those sites with 2 or so alternative reads would simply be considered invariant (which I believe is true, sequencing errors and such could easily produce 2 different base calls). However, an increase in ploidy to the "true" level would identify (probably lowqual) variable sites. With multiple pools in the same these might even turn to qual sites. And of course with pooled data, I have the tabulated readcounts as proxies for my pool-wise allele frequencies.
    So I wonder, would I be erring on the for me safer (=conservative) side,i.e. underestimating variation, if I was running analyses with a lower-than-true ploidy level? And if so, what would be a good way of capping ploidy (e.g. by avg read coverage or such)?

  • KNSKNS Member

    thanks a lot, will do!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @KNS I looked into this a little further and found that there is a hardcoded ploidy limitation that specifically applies to the reference confidence model (I learn something new every day). The good news is that the developers may be able to adapt the model to overcome this limitation. Could you please upload a snippet of your file that reproduces the error? It's helpful for them to have a real case to develop the fix with. Instructions are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • KNSKNS Member

    Well, for starters the error can be reproduced with the example data provided by GATK v. 3.3-0. So I wonder if it's not a fairly simple hardcoded limitation somewhere in the code, rather than a peculiarity of our data? Submitting something like this produces the error:

    [xxx GenomeAnalysisTK-3.3-0]$ java -Xmx20g -jar GenomeAnalysisTK.jar \

    -R /xxx/GenomeAnalysisTK-3.3-0/resources/exampleFASTA.fasta \
    -T HaplotypeCaller \
    -I /xxx/GenomeAnalysisTK-3.3-0/resources/exampleBAM.bam \
    --sample_ploidy 48 \
    --min_base_quality_score 20 \
    --maxNumHaplotypesInPopulation 24 \
    --min_mapping_quality_score 20 \
    -dcov 10000 \
    --max_alternate_alleles 2 \
    -ERC BP_RESOLUTION \
    --doNotRunPhysicalPhasing \
    -nct 6 \
    -pairHMM VECTOR_LOGLESS_CACHING \
    -o /xxx/random.vcf

    INFO 08:11:59,934 HelpFormatter - --------------------------------------------------------------------------------
    INFO 08:11:59,936 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
    INFO 08:11:59,936 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 08:11:59,937 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 08:11:59,940 HelpFormatter - Program Args: -R /xxx/GenomeAnalysisTK-3.3-0/resources/exampleFASTA.fasta -T HaplotypeCaller -I /xxx/GenomeAnalysisTK-3.3-0/resources/exampleBAM.bam --sample_ploidy 48 --min_base_quality_score 20 --maxNumHaplotypesInPopulation 24 --min_mapping_quality_score 20 -dcov 10000 --max_alternate_alleles 2 -ERC BP_RESOLUTION --doNotRunPhysicalPhasing -nct 6 -pairHMM VECTOR_LOGLESS_CACHING -o /xxx/random.vcf
    INFO 08:11:59,961 HelpFormatter - Executing as xxx on Linux 2.6.32-504.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_71-mockbuild_2014_10_17_22_23-b00.
    INFO 08:11:59,961 HelpFormatter - Date/Time: 2014/12/03 08:11:59
    INFO 08:11:59,962 HelpFormatter - --------------------------------------------------------------------------------
    INFO 08:11:59,962 HelpFormatter - --------------------------------------------------------------------------------
    INFO 08:12:00,120 GenomeAnalysisEngine - Strictness is SILENT
    INFO 08:12:00,796 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 10000
    INFO 08:12:00,804 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 08:12:00,816 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
    INFO 08:12:00,823 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
    INFO 08:12:00,836 MicroScheduler - Running the GATK in parallel mode with 6 total threads, 6 CPU thread(s) for each of 1 data thread(s), of 12 processors available on this machine
    INFO 08:12:00,898 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO 08:12:00,911 GenomeAnalysisEngine - Done preparing for traversal
    INFO 08:12:00,912 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 08:12:00,912 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 08:12:00,912 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
    INFO 08:12:00,913 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
    INFO 08:12:00,913 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output
    INFO 08:12:01,034 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
    INFO 08:12:01,035 PairHMM - Performance profiling for PairHMM is disabled because HaplotypeCaller is being run with multiple threads (-nct>1) option
    Profiling is enabled only when running in single thread mode

    INFO 08:12:02,433 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 48
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.getIndelPLs(ReferenceConfidenceModel.java:269)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calculateRefConfidence(ReferenceConfidenceModel.java:227)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.referenceModelForNoVariation(HaplotypeCaller.java:1209)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:983)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 48
    ERROR ------------------------------------------------------------------------------------------
  • KNSKNS Member

    And even a minimal dummy CL like this triggers the error:
    [xxx GenomeAnalysisTK-3.3-0]$ java -Xmx20g -jar GenomeAnalysisTK.jar \

    -R /xxx/GenomeAnalysisTK-3.3-0/resources/exampleFASTA.fasta \
    -T HaplotypeCaller \
    -I /xxx/GenomeAnalysisTK-3.3-0/resources/exampleBAM.bam \
    --sample_ploidy 21 \
    -ERC BP_RESOLUTION \
    -o /xxx/random1.vcf

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @KNS,

    Yes, in this case it is a hard coded limitation, as I acknowledged in my previous comment. We know the exact line of code involved, and the developer has a pretty good idea of how they might be able to implement an unlimited version of the calculation. But when that's done we would like to test the new code on data that actually contains a high ploidy, as opposed to random data we have lying around (none of which goes that high in ploidy), in order to make sure the new code will do the right thing. There's a big difference between working (ie not crashing), and working correctly. If you'd rather not share data, or you don't have any candidate variant sites yet, that's fine, we can generate some artificial data -- it just makes the test less realistic.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Update: this will be fixed in the next nightly build.

Sign In or Register to comment.