Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

ConcurrentModificationException in GATK 3.7 GenotypeGVCFs

Hello,

I updated to GATK 3.7 and am seeing the following error when running a pretty simple GenotypeGVCFs command. If you'd like bug reports elsewhere (github?) i'm happy to put these there instead.

The command is basically like:

java -Xmx48g -Xms48g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R myGenome.fasta
--variant myVCF1.g.vcf.gz
--variant [total of 35 gVCFs]
-nda -nt 24 -stand_call_conf 30 --max_alternate_alleles 12 -A FractionInformativeReads

and the stack is:

13 Dec 2016 15:19:59,086 DEBUG: ##### ERROR --
13 Dec 2016 15:19:59,351 DEBUG: ##### ERROR stack trace
13 Dec 2016 15:19:59,371 DEBUG: java.util.ConcurrentModificationException
13 Dec 2016 15:19:59,392 DEBUG: at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966)
13 Dec 2016 15:19:59,931 DEBUG: at java.util.LinkedList$ListItr.next(LinkedList.java:888)
13 Dec 2016 15:19:59,952 DEBUG: at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.coveredByDeletion(GenotypingEngine.java:426)
13 Dec 2016 15:19:59,974 DEBUG: at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateOutputAlleleSubset(GenotypingEngine.java:387)
13 Dec 2016 15:20:00,052 DEBUG: at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:251)
13 Dec 2016 15:20:00,218 DEBUG: at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:392)
13 Dec 2016 15:20:00,251 DEBUG: at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:375)
13 Dec 2016 15:20:00,568 DEBUG: at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:330)
13 Dec 2016 15:20:00,638 DEBUG: at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:326)
13 Dec 2016 15:20:00,692 DEBUG: at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:304)
13 Dec 2016 15:20:00,735 DEBUG: at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:135)
13 Dec 2016 15:20:00,750 DEBUG: at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
13 Dec 2016 15:20:00,760 DEBUG: at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
13 Dec 2016 15:20:01,556 DEBUG: at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
13 Dec 2016 15:20:01,569 DEBUG: at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
13 Dec 2016 15:20:01,597 DEBUG: at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
13 Dec 2016 15:20:01,631 DEBUG: at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
13 Dec 2016 15:20:01,644 DEBUG: at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
13 Dec 2016 15:20:01,656 DEBUG: at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
13 Dec 2016 15:20:01,706 DEBUG: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
13 Dec 2016 15:20:01,725 DEBUG: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
13 Dec 2016 15:20:02,540 DEBUG: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
13 Dec 2016 15:20:02,557 DEBUG: at java.lang.Thread.run(Thread.java:745)
13 Dec 2016 15:20:02,584 DEBUG: ##### ERROR ------------------------------------------------------------------------------------------
13 Dec 2016 15:20:02,956 DEBUG: ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-g56f2c1a):
13 Dec 2016 15:20:02,968 DEBUG: ##### ERROR
13 Dec 2016 15:20:02,992 DEBUG: ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
13 Dec 2016 15:20:03,031 DEBUG: ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
13 Dec 2016 15:20:03,047 DEBUG: ##### ERROR Visit our website and forum for extensive documentation and answers to
13 Dec 2016 15:20:03,058 DEBUG: ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
13 Dec 2016 15:20:03,390 DEBUG: ##### ERROR
13 Dec 2016 15:20:03,402 DEBUG: ##### ERROR MESSAGE: Code exception (see stack trace for error itself)
13 Dec 2016 15:20:03,425 DEBUG: ##### ERROR ------------------------------------------------------------------------------------------

Comments

  • wang_yugui2wang_yugui2 china,beijingMember

    Another NullPointerException of GATK 3.7 GenotypeGVCFs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hi all, it looks like some race conditions were introduced in 3.7. Unfortunately we can't devote effort to addressing that since we are moving away from using this type of parallelism altogether. I strongly recommend using scatter gather parallelism instead of nt/nct. If that's not an option for you right now, try reducing the number of threads you are using to reduce the chance of running into this issue.
  • bbimberbbimber HomeMember

    Is that diagnosis just a guess based on the error, or have you gone as far as to actually identify potential culprit changes?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    It's based on the stack trace and on suddenly getting these reports with 3.7. If I had to guess I would look at 'f410451 - Change HashMap to LinkedHashMap for predictable ordering' as a starting point to investigate where the problem was introduced. There are a few other changes that may stand out as possible culprits -- but we just can't take the time to track it down precisely. The most principled way to do this, if you wanted to look into it yourself, would be a git bisect approach.
  • bbimberbbimber HomeMember

    I think it might more likely be: Removed spanning deletions if the deletion was removed

    https://github.com/broadgsa/gatk-protected/commit/7392c4d1b068113f87139ad05ffe5cc4ddab7045

    If this seems plausible, I dont think fixing it would be that difficult, and i would be willing to try this; however, it would be great if the author (levine) was willing to take a quick look.

  • bbimberbbimber HomeMember

    A fix could be as simple as wrapping the LinkedList in GenotypingEngine:

    private final List upstreamDeletionsLoc = Collections.synchronizedList(new LinkedList<>());

    but again, it would be great to run this by someone a little more familiar w/ that code

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's a pretty good guess. Do you have a reproducible test case? That's usually the pain point where it comes to thread safety issues.

  • bbimberbbimber HomeMember

    It's quite reproducible on our data, but I dont currently have a minimal test case. Let me see what I can do.

  • wang_yugui2wang_yugui2 china,beijingMember

    It's quite reproducible on my data too. If the hotfix is ready, please post it, then I can test it.

  • bbimberbbimber HomeMember

    @wang_yugui2: is your data based on a standard human genome? if so, would it be possible to make a minimal VCF file (perhaps just a few variants to trigger this) and share it?

  • bbimberbbimber HomeMember

    also, while my understanding of exactly how multiple threads behave, I am fairly certain the changes introduced in 3.7 will never work correctly if more than one thread is processing data.

    The intent of the change is to keep a running list of deletion intervals, and when processing a given loci it scans this list of upstream deletions that span the current site. That seems reasonable. At it moves along the chromosome, it discards deletions that are completely upstream of the current position (GenotypingEngine.coveredByDeletion()). If only one thread is operating in a clean linear way across the chromosome, I can see why this makes sense and how it will work.

    Here's where my understanding of -nt could be wrong, but dont you have multiple threads processing different intervals? therefore if one thread starts discarding overlapping deletions, couldnt this screw things up for the other thread working on a different interval? ConcurrentModificationException aside, couldnt this give the wrong result?

    Issue · Github
    by Sheila

    Issue Number
    1574
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bbimber, I think you're right about this -- it looks like we don't have a test for the multithreaded case, hence failed to detect this bug. We're looking into options now to get this patched. A reproducible test case would help expedite the process greatly.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bbimber Upon further examination it looks like this might be difficult to fix, and as we're winding down commitment of resources to GATK3 in favor of GATK4, I'm worried we may not be able to allocate resources to this. In GATK4 we no longer use internal nt/nct style multithreading; we use either scatter-gather strategies at the pipelining level, which we find more sane and have been using exclusively in production for some time, and in some tools we're baking in Spark functionality. So it's very possible we may be limited to patching this by simply disabling multithreading of GenotypeGVCFs.

    I'll update this thread when we get to a decision. Note that we'd also be willing to take a pull request in an external developer proposes a fix, in the interest in maintaining functionality for the community.

  • bbimberbbimber HomeMember

    Yeah, i also didnt see an obvious fix. Just to make sure I understand the prior behavior, before 3.7, GATK simply didnt output spanning deletions, right? A deletion was recorded on the first position, and the first position only?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bbimber
    Hi,

    We were actually outputting spanning deletions at all affected positions in previous versions, but in 3.7 they are being removed if they are not necessary. That is what is causing the issue.

    -Sheila

  • bbimberbbimber HomeMember

    That's interesting, but if I am seeing the code's history right, from 3.6->3.7 there were some pretty sizable changes in how this takes place. Can it simply be reverted to 3.6's handling of deletions?

    Alternately, how important is it to remove unnecessary deletions? If we dont call it.remove() doesnt this also probably fix it?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bbimber The spanning deletion removal functionality was requested by some of our biggest stakeholders (who don't use multithreading), so reverting it outright would probably not fly. We could potentially make it optional, e.g. turned on by default but offer a disabling option, to serve the needs of users who need to use multithreading. That being said, I now worry that there may be other synch issues that we just don't know about and I would prefer if no one ever used multithreading with HaplotypeCaller and GenotypeGVCFs anymore. Also, I would like a pony and a ticket to a chocolate factory...

    Seriously though, what would you think of an option to disable the problem function? If that works, it might be the best we can offer...

  • bbimberbbimber HomeMember

    Thanks for the replies. I must be missing something on the background behind these changes. Is the idea that if something removes the initial deletion record (maybe because of filtering?), then we now prune the * alternate allele from subsequent positions? Was the original bug something related to these spanning-deletion alleles persisting even if the original deletion was removed?

    Let me ask a couple things:

    1) In queue, the genome is essentially split up into X number of intervals, a separate jobs act upon each, right? These jobs are basically independent. In this situation, what if the start of your deletion spans these jobs? If the deletion starts toward the end of the interval for one job, and terminates at the beginning of a second job's interval, do things work right? Does the engine build in some type of padding/overlap between them?

    2) I'm not opposed to using queue, but making a queue script to run a single tool always seemed incredibly silly. The java code declares arguments, and you already have code in your build to make Queue modules around most walkers (GATKExtensionsGenerator). I've ended up making silly little queue scripts in which just declare the arguments I want to use and run the appropriate module. Can I either a) run that Queue module from the command line directly, or b) do you guys have thoughts on whether a build-based solution like GATKExtensionsGenerator could auto-create vanilla queue scripts to wrap walkers? This would mean a much clearer swap between running a GATK directly, or running this tool via queue for every walker.

    Issue · Github
    by Sheila

    Issue Number
    1608
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bbimber,

    That's right, the added feature aims to remove the traces of spanning deletions when the original deletion record gets removed, which can happen during joint genotyping if the deletion fails to pass muster (remember that HC with -ERC GVCF sets the QUAL threshold to zero, but GenotypeGVCFs sets a non-zero threshold).

    Regarding your questions, I should preface my answers by saying that we have been steadily moving away from Queue and are now using an alternative system (an execution engine called Cromwell and a scripting language called WDL) for all our pipelining needs. But overall the answer to your questions are largely the same.

    1) It depends on the tool but generally, yes, Queue achieves parallelism by scattering over genomic intervals. Depending on the tool, Queue may be able to split up the genome completely arbitrarily (if individual positions can be handled entirely independently) or may require a list of processing intervals. In the latter case (which would be the case of GenotypeGVCFs) you'd give the exome targets list if running on exome data, or for WGS you'd either scatter by chromosome, or provide a list of intervals designed for this purpose (which we provide for some references). Assuming you ran HaplotypeCaller using that same list of intervals in the first place (which is the best way to run it), you would never have a deletion spanning two or more intervals because HC would have been limited to defining active regions that are confined within individual intervals.

    2) I agree that making single-workflow tasks is a lot of effort, especially if you run a lot of custom/one-off jobs, and especially in Queue which has all that Scala baggage. WDL is more lightweight -- but there's still some overhead to doing it that way of course. It would absolutely be possible to generate template QScripts per tool; in fact we provide per-tool json files that are meant to make it easy to generate such wrappers with e.g. a simple python script. This was originally done at the request of the Galaxy developers, but we use now use them to provide ready-to-use WDL task wrappers for all GATK tools (although we haven't made 3.7 wrappers yet, or the actual single-task workflow scripts -- coming soon!). All that being said, we usually work with multi-tool workflow scripts so we can pipeline as much as possible, and we rarely need to use single-tool workflows ourselves.

    Does that help?

  • bbimberbbimber HomeMember

    Somewhat. Much like you, I'm looking for the minimal solution that will fix this problem. I think I'll just write a one-off queue wrapper scala script.

    Regarding how the walker partitions the genome: I assumed the annotations: @PartitionBy(), @Reference(), etc. governed that (which seems nicer than the user needing to figure out custom intervals). Are you saying with HaplotypeCaller + GenotypeGVCFs I absolutely need to be more explicit when submitting a queue scatter/gather job, in order to avoid problematic situations?

    I can see why one needs to either split by chromosome or use some intelligently designed set of intervals, but we work w/ a lot of different genomes, with new ones added frequently, and I'd prefer to have as little onus on the users actually submitting jobs. Is there any trick in the command line or queue script to tell HaplotypeCaller + GenotypeGVCFs to partition per chromosome, without needing to supply that chromosome list upfront? This would make it a little friendlier across variable genomes.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @bbimber,

    I'm sure @Geraldine_VdAuwera has an answer more specific to your use case. In the meanwhile, I'd like to point you to a general solution that our production pipeline uses here. The context is a WDL script. The task, CreateSequenceGroupingTSV, uses a python script to dynamically create an intervals list that groups intervals by contig.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The CreateSequenceGroupingTSV task @shlee links too should do the trick, or at least provide a good starting point for a custom implementation, @bbimber.

  • bbimberbbimber HomeMember

    Thanks to you both for the help. Alright, a few more questions though:

    1) HaplotypeCaller declares @PartitionBy(PartitionType.LOCUS). If i understand this correctly, if I'm using queue or similar, the smallest unit and single scatter job will be given is one complete locus, right? Therefore if I have 20 chromosomes, and ask queue to run 1000 scatters jobs, I will at most only have 20 jobs (which seems like a good thing).

    2) If that's true, given what we've been discussing about GenotypeGVCFs and spanning deletions, should GenotypeGVCFs also declare @PartitionBy(PartitionType.LOCUS)? If this is your best practice and we have a pretty clear case that breaks the results w/o moving in a linear manner across the full chromosome, isnt that the right thing to do?

    3) Assuming I am going to use queue to kick off HaplotypeCaller and GenotypeGVCFs, do I really need to manually supply it with a list of intervals, or manually kick off N jobs, each with a custom interval I defined? If I understand @PartitionBy(), shouldnt this handle that for me? This is effectively the same question as #1. If I run queue using -scatterCount=1000 (an excessively high #), and if the walker properly declares its PartitionType, I should expect to get the proper outcome (no splitting of chromosomes) without needing to do any extra work, correct? Likewise, if I run queue with a low scatterCount (say -scatterCount=3 for a 20 chromosome genome), each job should still get intact chromosomes only, right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    1) No, if I recall correctly, when partitioning by LOCUS (ie individual position) the work will be scattered to the number you requested. The program will attempt to produce N batches of roughly the same amount of loci to process. If you want to limit the number of jobs you need to set the scatter count accordingly.

    2) GenotypeGVCFs does indeed partition by LOCUS as well, but what we're finding is that it would make more sense to make the scatter unit the interval, where you'd use the same intervals as for HaplotypeCaller. In that configuration you won't have any boundary issues since by definition you couldn't have events overlapping a boundary. The way this is handled in Queue is a little too naive; that's one of the reasons we're now using WDL instead because it gives us more direct control over how the scattering is effected.

    3) No, see explanations above. I really encourage you to switch to WDL instead. We're going to post some full WDL scripts for the joint calling portion of the Best Practices pipeline within the next week or so.

  • bbimberbbimber HomeMember

    OK, thanks for all the help. I realize cronwell/WDL is where GATK is going, but I had been hoping for a more surgical fix to the exception that began this whole thread. I'll see how much work it would be to swap in cromwell/WDL to our pipelines.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bbimber I understand -- it's certainly not trivial to switch pipelining systems! On the bright side we have way more resources allocated to supporting WDL than we ever did for Queue.

    FYI we have a fix for the race condition that causes the error (should be in the nightly build tomorrow).

    Unfortunately we have decided that there's nothing that can be done realistically in GATK 3.* to resolve the underlying issue as relates to spanning deletions that jump over a shard boundary. In GATK4 it doesn't apply because sharding is done differently, so we can't justify the resources to attempt a fix in 3.x. So we are considering the bug to be a limitation of GATK 3.

    That being said, with the race condition out of the way, your analysis should complete successfully. The only consequence of the bug is that there may be isolates cases of records where a sample has * in a genotype call in the place of the REF allele. This should be extremely rare and would have occurred anyway in versions prior to 3.7; it's just that now we can remove almost all cases except those that are accidentally too close to a shard boundary.

    Does that make sense?

  • bbimberbbimber HomeMember

    Yes, it does. Thanks. I will probably still try to migrate since I'd rather not have the issue creep in in the future.

    We actually build our own JAR since we add some additional code. Should I see this in the 3.7 branch?

  • wang_yugui2wang_yugui2 china,beijingMember

    The daily 20170115 seems OK on 48 thread machine.

  • ronlevineronlevine Lexington, MAMember
    edited January 2017

    @wang_yugui2 @bbimber The fix for the thread safety issue was in the 2017-01-13-gb1942bc nightly build.

    Post edited by ronlevine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bbimber You won't see the fix in the public repository because that's just a snapshot of the release; any new code is embargoed until we release a patch (an unfortunate setup legacy that goes away in GATK4). That may happen sometime next week; I need to check the status of a couple of things first. Ping me if you don't hear anything by Wednesday.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bimber I should add: if you need this earlier than we can make a patch, here are the necessary code changes:

    In protectedected/src/main/java/org/broadinstitute/gatk/tools/walkers/genotyper/GenotypingEngine.java:

    Line 109

    -    private final List<GenomeLoc> upstreamDeletionsLoc = new LinkedList<>();
    +    private final ThreadLocal< List<GenomeLoc> > upstreamDeletionsLoc = ThreadLocal.withInitial(() -> new LinkedList<GenomeLoc>());
    

    Line 414

    -            upstreamDeletionsLoc.add(genomeLoc);
    +            upstreamDeletionsLoc.get().add(genomeLoc);
    
  • wang_yugui2wang_yugui2 china,beijingMember

    But nightly build 20170115 seem to run out of memory. There are some memory leak?
    [1191777.827257] Out of memory: Kill process 27911 (java) score 935 or sacrifice child
    [1191777.828040] Killed process 27911 (java) total-vm:266412772kB, anon-rss:253880776kB, file-rss:180kB, shmem-rss:0kB

    usage: -nt 20
    total memory :256G
    os:centos 7.3
    java:1.8

  • wang_yugui2wang_yugui2 china,beijingMember

    GATK 3.6 run out of memory too on same machine.

    The memory usage of java GenotypeGVCFs is always increasing during running and run out of memory at chr6.
    There maybe not memory enough or some memory leak.

  • bbimberbbimber HomeMember

    I believe in that patch you also want line 425, but that's perfect to unblock us. Thanks again for the help. Also great to hear about process/embargo changes in GATK4.

  • ronlevineronlevine Lexington, MAMember
    edited January 2017

    @bbimber You are correct.
    In protectedected/src/main/java/org/broadinstitute/gatk/tools/walkers/genotyper/GenotypingEngine.java:
    Line 425

    -     for (Iterator<GenomeLoc> it = upstreamDeletionsLoc.iterator(); it.hasNext(); ) {
    +     for (Iterator<GenomeLoc> it = upstreamDeletionsLoc.get().iterator(); it.hasNext(); ) {
    
Sign In or Register to comment.