To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GenotypeGVCFs will give a stack trace if using -nct > 1 and -allSites

MattBMattB NewcastleMember
edited December 2016 in Ask the GATK team

Hi, I'm getting a NullPointerException (see trace below looks like some kind of NanoScheduler issue) when using GenotypeGVCFs and the -allSites argument when using more than one core via -nct. Running the same operation with -nct 1 or not using -allSites and -nct > 1 everything will complete successfully. The error is reproducible using both the original qual scores and -newQual.

INFO  00:09:07,418 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  00:09:07,423 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  00:09:07,424 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  00:09:07,424 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  00:09:07,424 HelpFormatter - [Tue Dec 20 00:09:07 GMT 2016] Executing on Linux 2.6.32-573.7.1.el6.x86_64 amd64 
INFO  00:09:07,425 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 
INFO  00:09:07,433 HelpFormatter - Program Args: --disable_auto_index_creation_and_locking_when_reading_rods -T GenotypeGVCFs -nt 5 -R /opt/databases/GATK_bundle/2.8/b37/human_g1k_v37_decoy.fasta --db
snp /opt/databases/GATK_bundle/2.8/b37/dbsnp_138.b37.vcf -maxAltAlleles 50 -allSites --variant XTO_10_HC.g.vcf --variant XTO_11_HC.g.vcf --variant XTO_12_HC.g.vcf --variant XTO_13_HC.g.vcf --variant X
TO_14_HC.g.vcf --variant XTO_15_HC.g.vcf --variant XTO_16_HC.g.vcf --variant XTO_17_HC.g.vcf --variant XTO_18_HC.g.vcf --variant XTO_19_HC.g.vcf --variant XTO_1_HC.g.vcf --variant XTO_20_HC.g.vcf --va
riant XTO_21_HC.g.vcf --variant XTO_22_HC.g.vcf --variant XTO_23_HC.g.vcf --variant XTO_24_HC.g.vcf --variant XTO_25_HC.g.vcf --variant XTO_26_HC.g.vcf --variant XTO_27_HC.g.vcf --variant XTO_28_HC.g.
vcf --variant XTO_29_HC.g.vcf --variant XTO_2_HC.g.vcf --variant XTO_30_HC.g.vcf --variant XTO_31_HC.g.vcf --variant XTO_32_HC.g.vcf --variant XTO_33_HC.g.vcf --variant XTO_34_HC.g.vcf --variant XTO_3
5_HC.g.vcf --variant XTO_3_HC.g.vcf --variant XTO_4_HC.g.vcf --variant XTO_5_HC.g.vcf --variant XTO_6_HC.g.vcf --variant XTO_7_HC.g.vcf --variant XTO_8_HC.g.vcf --variant XTO_9_HC.g.vcf -o XTO.HC_geno
typed.vcf --log_to_file /sharedlustre/users/nmb86/T_ALL_Original_XT_2ndGo/GenotypeGVCFs/XTO.GenotypeGVCFs.log 
INFO  00:09:07,437 HelpFormatter - Executing as nmb86@compute2-6.clusterlan on Linux 2.6.32-573.7.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14. 
INFO  00:09:07,437 HelpFormatter - Date/Time: 2016/12/20 00:09:07 
INFO  00:09:07,437 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  00:09:07,438 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  00:09:07,485 GenomeAnalysisEngine - Strictness is SILENT 
INFO  00:09:07,653 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  00:09:07,994 MicroScheduler - Running the GATK in parallel mode with 5 total threads, 1 CPU thread(s) for each of 5 data thread(s), of 20 processors available on this machine 
INFO  00:09:08,184 GenomeAnalysisEngine - Preparing for traversal 
INFO  00:09:08,191 GenomeAnalysisEngine - Done preparing for traversal 
INFO  00:09:08,192 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  00:09:08,192 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  00:09:08,193 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
WARN  00:09:08,417 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has t
he SB genotype annotation, annotation may still fail. 
WARN  00:09:08,418 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has t
he SB genotype annotation, annotation may still fail. 
INFO  00:09:08,418 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
WARN  00:09:13,431 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs 
INFO  00:09:38,211 ProgressMeter -     2:181008147   9361575.0    30.0 s       3.0 s       13.7%     3.6 m       3.1 m 
##### ERROR --
##### ERROR stack trace 
java.lang.NullPointerException
        at java.util.LinkedList$ListItr.next(LinkedList.java:893)
        at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.coveredByDeletion(GenotypingEngine.java:426)
        at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateOutputAlleleSubset(GenotypingEngine.java:387)
        at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:251)
        at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:392)
        at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:375)
        at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:330)
        at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:326)
        at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:304)
        at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:135)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------
Post edited by MattB on

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hey @MattB, I think we simply had a little misunderstanding here -- I said "don't use -nct" in my response because you wrote -nct originally, and I didn't pay close attention, but the doc is correct that the only multithreading mode supported by GenotypeGVCFs is in fact -nt, which is what you were using.

    Anyway the underlying error you encountered was due to a simple race condition that has been fixed in the nightly build, so that should work for you now. There were a few other thread safety issues that got fixed along the way. We're going to cut a patch release to get the fixes out there soon.

    My word of caution on the current multithreading options being error prone still stands, though... In GATK4 we'll use Spark instead.

Answers

  • MattBMattB NewcastleMember

    Good to know, I've experienced it now without -allSites with seven genomes worth of data using the normal joint genotyping workflow, its worth stating that prior to 3.7 I'd not experienced this.

  • MattBMattB NewcastleMember

    Also worth stating for the record I was using -nt not -nct, see Program Args in perforated text (my bad in composing original post). You might want to consider amending the documentation as currently states GenotypeGVCFs is "TreeReducible (-nt)" https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php

    Issue · Github
    by Sheila

    Issue Number
    1639
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @MattB
    Hi,

    Have a look at this thread. Thanks for the suggestions. I am about to make a ticket for this.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    Hey @MattB, I think we simply had a little misunderstanding here -- I said "don't use -nct" in my response because you wrote -nct originally, and I didn't pay close attention, but the doc is correct that the only multithreading mode supported by GenotypeGVCFs is in fact -nt, which is what you were using.

    Anyway the underlying error you encountered was due to a simple race condition that has been fixed in the nightly build, so that should work for you now. There were a few other thread safety issues that got fixed along the way. We're going to cut a patch release to get the fixes out there soon.

    My word of caution on the current multithreading options being error prone still stands, though... In GATK4 we'll use Spark instead.

  • MattBMattB NewcastleMember
    edited January 2017

    Hi @Geraldine_VdAuwera thanks for getting back on this, and good to know there is patch release in the works, I was somewhat concerned that something was no longer thread safe, yes apologies again for my typo on -nct in place of -nt and appreciate that -nct is not formally supported for GenotypeGVCFs. And thanks @Sheila that thread is very revealing on this issue.

    I'm very much looking forward to GATK4, although I'm unsure as to how Spark will work in a none could setting I'm guessing this is going to be the main workload for me in getting it up and running. In our case we use Son of Grid Engine as our job scheduler on a cluster with a Luster FS. As it's difficult to either get permission to place our data on the cloud, and or difficult to persuade people to fund computing as a consumable. I assume you can wrap Spark jobs as SGE jobs via WDL or some such, hopefully it will be easier than using queue? I guess what I'm trying to get at is I imagine there are plenty of users which will still be running GATK3.x on a local cluster with LSF, PBS, or a SGE derivative who would like to also use scatter gather Spark parallelism in 4.x hopefully in a way which is less painful than using Queue. I guess I would also like a pony and a ticket to a chocolate factory...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @MattB,

    I'm not yet fully up to speed on the requirements for the Spark tools, but my understanding is that they can also be used locally on multi-core machines (whether laptop of cluster) with minimal configuration. There's a discussion thread here that covers some of the main points of why and how we're moving to Spark.

    And this should be workable also in combination with Cromwell/WDL and a scheduler, sure. When we move to general release of GATK4 we'll put together a migration guide that will include recommendations for all the popular cluster setups we know people use when not on cloud. You might just get that pony... ;)

  • MattBMattB NewcastleMember

    Good to know! Look forward to reading it, it would be very helpful.

Sign In or Register to comment.