We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

MarkDuplicatesSpark fails to complete on Cluster

riederrieder InnsbruckMember


I'm trying to use GATK 4.1.2 on a SPARK cluster. I did my first test using MarkDuplicatesSpark, however I was not very successful. It seems until the merging of the results it works fine but then the job finishes with an error "java.lang.IllegalArgumentException: Cannot merge zero BAI files" when the BAI files should be merged.

Here are the last lines of the command line output:

19/07/04 14:01:33 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool
19/07/04 14:01:33 INFO DAGScheduler: ResultStage 18 (runJob at SparkHadoopWriter.scala:78) finished in 51.231 s
19/07/04 14:01:33 INFO DAGScheduler: Job 5 finished: runJob at SparkHadoopWriter.scala:78, took 124.964139 s
19/07/04 14:01:33 INFO SparkHadoopWriter: Job job_20190704135928_0045 committed.
19/07/04 14:01:33 WARN HadoopFileSystemWrapper: Concat not supported, merging serially
19/07/04 14:01:34 INFO IndexFileMerger: Merging .sbi files in temp directory marked_duplicates.bam.parts/ to /data/scratch/rieder/spark/marked_duplicates.bam.sbi
19/07/04 14:01:34 INFO IndexFileMerger: Done merging .sbi files
19/07/04 14:01:34 INFO IndexFileMerger: Merging .bai files in temp directory marked_duplicates.bam.parts/ to /data/scratch/rieder/spark/marked_duplicates.bam.bai
19/07/04 14:01:34 INFO SparkUI: Stopped Spark web UI at http://zeus.icbi.local:4040
19/07/04 14:01:34 INFO StandaloneSchedulerBackend: Shutting down all executors
19/07/04 14:01:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/07/04 14:01:34 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/04 14:01:34 INFO MemoryStore: MemoryStore cleared
19/07/04 14:01:34 INFO BlockManager: BlockManager stopped
19/07/04 14:01:34 INFO BlockManagerMaster: BlockManagerMaster stopped
19/07/04 14:01:34 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/04 14:01:34 INFO SparkContext: Successfully stopped SparkContext
14:01:34.687 INFO MarkDuplicatesSpark - Shutting down engine
[July 4, 2019 2:01:34 PM CEST] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 7.13 minutes.
java.lang.IllegalArgumentException: Cannot merge zero BAI files
at htsjdk.samtools.BAMIndexMerger.finish(BAMIndexMerger.java:51)
at org.disq_bio.disq.impl.file.IndexFileMerger.mergeParts(IndexFileMerger.java:92)
at org.disq_bio.disq.impl.formats.bam.BamSink.save(BamSink.java:132)
at org.disq_bio.disq.HtsjdkReadsRddStorage.write(HtsjdkReadsRddStorage.java:225)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:155)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:120)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:361)
at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.runTool(MarkDuplicatesSpark.java:325)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:528)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/07/04 14:01:34 INFO ShutdownHookManager: Shutdown hook called
19/07/04 14:01:34 INFO ShutdownHookManager: Deleting directory /local/spark/rieder/spark-687029a6-9ce0-423b-a2dd-aef4aafacd16
19/07/04 14:01:34 INFO ShutdownHookManager: Deleting directory /tmp/spark-89c32d63-7b82-4e3d-bfd3-551b7ce37941

I run the following script

export PATH=$PATH:/opt/spark/bin
/usr/local/bioinf/gatk/latest4/gatk MarkDuplicatesSpark \
-I TEST_N.sorted.bam \
-O marked_duplicates.bam \
-M marked_dup_metrics.txt \
-- \
--spark-runner SPARK \
--spark-master spark:// \
--num-executors 1 \
--executor-cores 8

Can anyone help me to get this running?

Best Answer


  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @rieder,

    In the directory where TEST_N.sorted.bam is located, is there a corresponding index file?
    You may need to create a new index file using BuildBamIndex.

  • riederrieder InnsbruckMember

    Hi @bshifaw,

    thanks for your answer and sorry for my late reply.
    Unfortunately creating a new index file using picard as you suggested did not solve the problem. I get the same error. "Cannot merg zero BAI files"

    In the spark worker directories I can see BAI files though, e.g.:

    I see:

    Any idea?

  • bshifawbshifaw Member, Broadie, Moderator admin

    Would it be possible to check whether you are able to run it on regular Markduplicates (no spark), this will help determine whether its a spark related error.
    Also validate your bam file using ValidateSamFile to make sure there's nothing wrong with file you are using.

  • riederrieder InnsbruckMember


    I just tried to run the regular version:

    /usr/local/bioinf/gatk/latest4/gatk MarkDuplicates \
    -I TEST_N.sorted.bam \
    -O marked_duplicates.bam \
    -M marked_dup_metrics.txt \

    and the SPARK version w/o cluster (just local):

    /usr/local/bioinf/gatk/latest4/gatk MarkDuplicatesSpark \
    -I TEST_N.sorted.bam \
    -O marked_duplicates.bam \
    -M marked_dup_metrics.txt \
    --conf 'spark.executors.cores=4'

    Both were finishing successfully.
    It is strange to me, that the spark cluster run does not succeed.


  • bshifawbshifaw Member, Broadie, Moderator admin

    Sounds like the tools and input are fine but there might be something wrong with the Spark setup. I'll have to refer to the dev team about this one. In the mean time you can try to confirm your cluster is setup correctly and the Spark arguments are correct. Also take note of the recommendations from tool document
    This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible.

  • riederrieder InnsbruckMember

    I tried to run all on local SSD disks, unfortunately this didn't help.

    Here my settings for the spark cluster:

    I set the location of the output to a local ssd as well "-O /local/marked_duplicates.bam "
    The input bam needs to be located on a shared (over cluster nodes) filesystem, otherwise I get file not found from the spark workers. In our case it is cephfs.

  • riederrieder InnsbruckMember


    omitting the index creation helped. Thanks!
    Looking forward to the new improved version.

Sign In or Register to comment.