Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

GATK4 BaseRecalibratorSpark , Executor heartbeat timed out after X ms

TintestTintest FranceMember
edited March 2018 in Ask the GATK team

Hello, i'm trying to run GATK4 BaseRecalibratorSpark in local in a nextflow pipeline, but I got every time the same error with this command :

gatk-launch BaseRecalibratorSpark --spark-runner LOCAL --spark-master local[23] -R $indexbit -I ${input_bam} --known-sites ${params.dbsnpAll} -L ${params.targetBed} -O ${output_name}-recalibration_report.grp

Here is the full error output, basically, the failure seems to be linked to an absence of activity of the BaseRecalibratorSpark during 12s which kill the spark heartbeat and then the processus :

18/03/09 09:22:08 WARN Executor: Issue communicating with driver in heartbeater
java.lang.NullPointerException
        at org.apache.spark.storage.memory.MemoryStore.getSize(MemoryStore.scala:127)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:387)
        at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:218)
        at org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:217)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.storage.BlockManager.reportAllBlocks(BlockManager.scala:217)
        at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:236)
        at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:522)
        at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
        at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
        at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
        at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
18/03/09 09:22:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/03/09 09:22:08 INFO SparkContext: Successfully stopped SparkContext
09:22:08.389 INFO  BaseRecalibratorSpark - Shutting down engine
[March 9, 2018 9:22:08 AM UTC] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark done. Elapsed time: 61.53 minutes.
Runtime.totalMemory()=16815489024
org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 126542 ms
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1936)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1002)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
        at org.apache.spark.rdd.RDD.reduce(RDD.scala:984)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1127)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
        at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1104)
        at org.apache.spark.api.java.JavaRDDLike$class.treeAggregate(JavaRDDLike.scala:438)
        at org.apache.spark.api.java.AbstractJavaRDDLike.treeAggregate(JavaRDDLike.scala:45)
        at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.apply(BaseRecalibratorSparkFn.java:39)
        at org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark.runTool(BaseRecalibratorSpark.java:159)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:387)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:159)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:201)
        at org.broadinstitute.hellbender.Main.main(Main.java:287)
18/03/09 09:22:08 INFO ShutdownHookManager: Shutdown hook called
18/03/09 09:22:08 INFO ShutdownHookManager: Deleting directory /tmp/qtestard/spark-af28085b-4317-4d0f-85dd-a966b1d26d9d

I got the same one when I try to execute it outside of nextflow. I also tried to run it with --conf spark.executor.heartbeatInterval=120, but it seems it is useless, i'm not sure it is the good syntax for a local execution of spark.

Here is attached, the complete log file.

Can you help me ?

Also posted on https://github.com/broadinstitute/gatk/issues/4515

Best regards.

Post edited by Tintest on

Answers

Sign In or Register to comment.