Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Executor heartbeat timed out after X ms | StructuralVariationDiscoveryPipelineSpark

SakhaaSakhaa Member
edited September 15 in Ask the GATK team

Hello GATK team,

I'm trying to run 'StructuralVariationDiscoveryPipelineSpark' to find CNVs, it starts well but after while it gives this error 'ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 137575 ms'

(part of the running commands when error start to appear):

"19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece610 !

19/09/13 06:49:18 INFO MemoryStore: MemoryStore cleared

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece611 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece614 !

19/09/13 06:49:18 INFO BlockManager: BlockManager stopped

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece615 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece612 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece613 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece607 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece608 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece605 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece606 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece609 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece600 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece603 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece604 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece601 !

19/09/13 06:49:18 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_17_piece602 !

19/09/13 06:49:18 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockManagerRemoved(1568346558188,BlockManagerId(driver, 10.109.201.103, 40444, None))

19/09/13 06:49:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(driver, 10.109.201.103, 40444, None)

19/09/13 06:49:18 INFO BlockManagerMasterEndpoint: Registering block manager 10.109.201.103:40444 with 15.8 GB RAM, BlockManagerId(driver, 10.109.201.103, 40444, None)

19/09/13 06:49:18 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockManagerAdded(1568346558190,BlockManagerId(driver, 10.109.201.103, 40444, None),16990076928,Some(16990076928),Some(0))

19/09/13 06:49:18 INFO BlockManagerMaster: BlockManagerMaster stopped

19/09/13 06:49:18 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

19/09/13 06:49:18 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.109.201.103, 40444, None)

19/09/13 06:49:18 INFO BlockManager: Reporting 0 blocks to the master.

19/09/13 06:49:18 INFO SparkContext: Successfully stopped SparkContext

06:49:18.208 INFO StructuralVariationDiscoveryPipelineSpark - Shutting down engine

[September 13, 2019 6:49:18 AM AST] org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryPipelineSpark done. Elapsed time: 88.37 minutes.

Runtime.totalMemory()=31999918080
org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 8.0 failed 1 times, most recent failure: Lost task 8.0 in stage 8.0 (TID 37720, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 137575 ms

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)

at scala.Option.foreach(Option.scala:257)

at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)

at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.collect(RDD.scala:935)

at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)

at org.broadinstitute.hellbender.tools.spark.sv.evidence.FindBreakpointEvidenceSpark.removeUbiquitousKmers(FindBreakpointEvidenceSpark.java:660)

at org.broadinstitute.hellbender.tools.spark.sv.evidence.FindBreakpointEvidenceSpark.addAssemblyQNames(FindBreakpointEvidenceSpark.java:507)

at org.broadinstitute.hellbender.tools.spark.sv.evidence.FindBreakpointEvidenceSpark.gatherEvidenceAndWriteContigSamFile(FindBreakpointEvidenceSpark.java:176)

at org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryPipelineSpark.runTool(StructuralVariationDiscoveryPipelineSpark.java:164)

at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:528)

at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)

at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)

at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)

at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)

at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)

at org.broadinstitute.hellbender.Main.main(Main.java:291)

19/09/13 06:49:18 INFO ShutdownHookManager: Shutdown hook called

19/09/13 06:49:18 INFO ShutdownHookManager: Deleting directory /tmp/spark-57685327-f8c7-4813-88d6-c9ef0f8a721f

Using GATK jar /sw/csi/gatk/4.1.2.0/el7.5_binary/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar

Running:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /sw/csi/gatk/4.1.2.0/el7.5_binary/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar StructuralVariationDiscoveryPipelineSpark -I RMNISTHS_30xdownsample.sorted.bam -R /ibex/scratch/althubsw/ref/human_g1k_v37.2bit --aligner-index-image reference19.fa.img --kmers-to-ignore kmers_to_ignore19.txt --contig-sam-file aligned_contigs.sam -O structural_variants.vcf "

I sent it as a job in slurm job scheduler, I specify 400 GB for it and my BAM file size 148 GB

Any help to avoid that would be appreciated.

Thanks.

Answers

  • sarawaslsarawasl Member
    I'm having the same error, any help/suggest to avoid that please?
  • shuangBroadshuangBroad Broad75Member, Broadie, Dev
    edited September 16

    Hi @Sakhaa and @sarawasl , thanks for testing out our Spark SV pipeline!

    I do some questions before I can make concrete suggestions.

    1. what motivated you to use an SV pipeline for CNVs? We do have gCNV pipeline for germline CNV calling, if that fits your need (having said that, I am totally aware that people want to find large duplications with basepair resolutioned breakpoint calls.)
    2. what is/are the coverage of BAMs? 148G seems to be high for a typical 30X WGS bam (the pipeline is indeed designed for WGS germline BAMs, so it may or may not work for a WES and/or cancer BAM)
    3. when you said "I specify 400 GB for it", was it referring to RAM or disk space? Based on the error message, it seems to be a memory issue. Can you post the script you used for launching the job?

    Thanks!

  • sarawaslsarawasl Member
    Thanks for pointing me @shuangBroad ,

    your question helped me to figure out what I'm using is wrong, now I'm converting to use GermlineCNVCaller
Sign In or Register to comment.