We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

MarkDuplicateSpark is slower than normal MarkDuplicates

I was happy to hear that MarkDuplicatesSpark is out of Beta now in Version 4.1. So I tested it on one of our WES samples. Unfortunatelly it took 2 times as long as the normal MarkDuplicate. Here are the two comands I used:

gatk MarkDuplicatesSpark -I '/media/Berechnungen/0028-19.recal.bam' -O '/media/Berechnungen/0028-19.dedup.bam' --metrics-file '/media/Berechnungen/0028-19.metrics' --spark-master local[4] --verbosity ERROR --tmp-dir /media/Ergebnisse/picardtmp/

gatk MarkDuplicates -I '/media/Berechnungen/0028-19.recal.bam' -O '/media/Berechnungen/0028-19.dedup.bam' -M '/media/Berechnungen/0028-19.metrics' --TMP_DIR /media/Ergebnisse/picardtmp/

Did I do something wrong in parallelization?

Bests Stefan


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @StefanDiederichMainz,

    Here are some questions from GATK developers that would help us determine what is going on.

    1. Can you try local[*] instead of local[4] and see what the wall-clock time is.
    2. What is your disk setup, e.g. are you using an NFS mount? Related to this, how many cores does your system have?
    3. Can you try sending the metrics file to-M /dev/null?
    4. Anything peculiar about your BAM? If the above does not shed light, then the developers ask if you would be able to share the entirety of your BAM file towards figuring out what may be going on. Directions for sharing data are at https://software.broadinstitute.org/gatk/guide/article?id=1894.

    Thanks for testing out MarkDuplicatesSpark.

  • Hi @shlee,

    thanks for your quick answer.
    1. I did some wall-clock time measurements
    MarkDuplicates: 15 min 55 sec
    MarkDuplicatesSpark local[4]: 28 min 26 sec
    MarkDuplicatesSpark local[8]: 17 min 49 sec
    MarkDuplicatesSpark local[*]: 11 min 00 sec
    2. We are using an SSD for the calculations which is directly in the server setup. So there is no NFS mount.
    3. Redirecting to-M /dev/null does not change the wall-clock time (local[8] with -M /dev/null needs 17 min 24 sec.
    4. I tried to upload the file but had some problems. If you need the file I will give it some more tries or will upload it to our file server and send you a link for downloading...

    By using all available cores (36) it seems to be faster than the normal MarkDuplicates. But I can not always use alle the available cores because several users are using this server.

    Do you see a similar behaviour in your environment?


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for the follow-up @StefanDiederichMainz. Scattering across 36 cores seems a bit much to be able to outperform the non-Spark tool. However, the developers mention that the metrics file is likely the bottleneck. So, we have one more thing for you to control for and this is the writing of the metrics file. It looks like although MarkDuplicates requires the metrics file, MarkDuplicatesSpark does not. A separate standalone tool, EstimateLibraryComplexity, allows for collecting the exact same metrics. The developers think that if you omit writing the metrics file, this will allow for MarkDuplicatesSpark to perform efficiently, as expected. So to be comparable in results, you can run MarkDuplicatesSpark without the metrics plus EstimateLibraryComplexity concurrently. Are the metrics important to your pipeline?

  • I do not need the metrics file in my pipeline but did not know that the spark tool does not require this option. I tested it without the -m option:

    • local[4]: 21 min 29 sec
    • local[8]: 13 min 38 sec
    • local[*]: 09 min 08 sec

    The spark tool is faster without writing a metrics file but I think the developers hopped to see a bigger effect...

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks, @StefanDiederichMainz, for testing the sans metrics case. I think the last thing for us to test is your BAM file. Was there a particular error when you tried to upload to our FTP site? Any which way you can share private data safely, can you please share with us? You can direct message me within the forum if you need.

  • mack812mack812 SpainMember

    Should not be MarkDuplicatesSpark be compared to MarkDuplicates + SortSam, as shown in the following link?


    The output of MDSpark is coordinate sorted.

  • StefanDiederichMainzStefanDiederichMainz MainzMember
    edited February 2019

    I uploaded the file to a fileshare folder of the University Mainz and send you a PM with the link and login data.
    One other small thing I found out is, that MDSpark does not take comma in filenames although I quoted the filepath. Is there any way to allow comma in filepath or do I have to avoid using them?

    I tested MDSpark with an unsorted SAM file as input and you are right, it works and the output is a sorted BAM file. So I can skip the SortSam tool wich will save me about 15-20 min in processing time. So MDSpark in total is faster than SortSam+MD +Indexing.
    Thanks for that hint!

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • mack812mack812 SpainMember

    Great! Glad to hear that the difference is substantial in your settings @StefanDiederichMainz

    In my case MDSpark took nearly 1 hour and a half for a large WES bam (around 140 million reads, 17GB in size), running on GCP cloud with 16 CPUs and 64 GiB RAM (it failed with 16 GiB or 32GiB RAM). I was expecting a shorter time but I might be wrong.

    @shlee, is there any way to tweak the runtime parameters and/or java options to make MDSpark run optimally?


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @StefanDiederichMainz and @mack812,

    Stefan--we've received your information and a developer will be looking into your case. We will either have some thoughts for you today or sometime after a week because they are unavailable for the next week.

    @mack812, it's been a long while since I thought of Spark. You might find the process I outline in https://gatkforums.broadinstitute.org/gatk/discussion/10060/how-to-run-flagstatspark-on-a-cloud-spark-cluster of interest towards monitoring how a run's workers consume resources (CPU, I/O etc) and how to ensure work is distributed equally among workers instead of just to a single worker, as appears to happen in the Spark run illustrated in the tutorial.

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭✭

    @mack812 That's definitely not what we'd expect. I would expect it to be both much faster and require less memory. Can you post your settings? Are you using a local SD on your cloud machine? We found very poor performance when running MDSpark using persistent disk either HDD or SSD when on google cloud.

  • mack812mack812 SpainMember

    Hi @LouisB,

    Thanks for replying.

    The length of the run that I mentioned on my previous post (1 hour and a half) is from the starting of the VM machine to the time when I got the .bam output in the bucket. A good portion of that is IO I imagine. According to the tool's log, which I am attaching, the computing elapsed time was 73.73 minutes (sorry for the confusion). If I understand correctly this log, the dup-marking task took 38 minutes while the rest of the time (35 mins) was for merging files, generating the bam indexes (bsi and bai) and I guess that also sorting the file by genomic coordinates.

    I am also attaching the .sh script generated by the GCP backend, in which you can see that the input was composed of 10 aligned bams, which were generated by a scattered run of bwa mem (I split the 2 starting fastqs in 10 pairs, which made the alignment really fast). I do not know if taking multiple bams as inputs instead of just one bam file could be messing with the expected run time, but I guess that this is the most common situation.

    One additional thing in case this could be also interfering with the expected running time: as you can see in the .sh script I did not use the optical distance and regex options. The fastqs that I am working with for this trials are from public repositories (i.e. SRA). SRA-deposited fastqs' read name is transformed into a SRR(number).1/SRR(number).2 naming, erasing the flowcell-lane-cluster position from the read ID and therefore making the marking of optical duplicates useless. Anyway, according to the metrics report that I generated with the "EstimateLibraryComplexity" tool for the dup-marked bam, the fraction of PCR dups is high on this bam (common in FFPE samples...): 34.5%. Maybe that also made the tool run slower than expected.

    Regarding the runtime parameters, this is the task configuration (I am hard scripting below the runtime parameters used):

          call MarkDuplicatesSpark {
              input_bams = input_bams,
              output_bam_basename = base_file_name + ".aligned.sorted.duplicates_marked",
              total_input_size = total_input_size,
              compression_level = compression_level,
              gatk_docker = gatk_docker,
              disk_pad = disk_pad,
              preemptible_tries = agg_preemptible_tries
          task MarkDuplicatesSpark {
            Array[File] input_bams
            String output_bam_basename
            Float total_input_size
            Int compression_level
            Int preemptible_tries
            String gatk_docker
            Int disk_pad
            Float md_disk_multiplier = 3.25
            Int disk_size = ceil(md_disk_multiplier * total_input_size) + disk_pad
            String? read_name_regex
            Int? optical_distance
            Int? mem
            Int machine_mem = if defined(mem) then mem * 1000 else 64000
            Int command_mem = machine_mem - 2000
            Int? cpu
            command {
            gatk --java-options "-Dsamjdk.compression_level=2 -Xms62000m" \
              MarkDuplicatesSpark \
                -I ${sep=' -I ' input_bams} \
                -O ${output_bam_basename}.bam \
                ${"--read-name-regex " + read_name_regex} \
                ${"--optical-duplicate-pixel-distance " + optical_distance} \
                -VS SILENT \
                -- --spark-master 'local[*]'
          runtime {
            docker: "us.gcr.io/broad-gatk/gatk:"
            cpu: 16
            preemptible: 10
            memory: "64 GB"
            disks: "local-disk 62 HDD"
          output {
            File output_bam = "${output_bam_basename}.bam"
            File output_bai = "${output_bam_basename}.bam.bai"

    Also, there was an error in my wdl script: asking the task to output a ".bai" index instead of the ".bam.bai" that is actually created by the tool (already corrected above), which explains the final error in the tool's log.

    Previous to this run I got two failed runs when setting the runtime memory to 16 GB or 32 GB. The tool stopped suddenly and threw out a weird error very early during the run, about not being able to write on my bucket because it had the "Requester Pays" feature ON (it was OFF both times). I did not keep these previous logs.

    Sorry if I am missing something very obvious here. Thanks for your help.

  • emeryjemeryj Member, Broadie
    edited February 2019

    @StefanDiederichMainz I have taken a look at your bam I would make a note that your input bam is currently coordinate sorted. MarkDuplicatesSpark is optimized for inputs that are either queryname sorted or querygrouped as it needs to group read pairs together. To get around this problem MarkDuplicatesSpark first sorts any input that isn't grouped by readnam, and then proceeds to mark duplicates as normal. I suspect this might explain your observation that MarkDuplicatesSpark is slower than advertised. MarkDuplicates is run immediately after mapping in our pipeline and thus the reads at that stage are typically grouped by readname. Consequently we haven't looked into optimizing the queryname grouped use case. It is worth noting that this problem also affects Picard MarkDuplicates, which won't mark secondary or supplementary reads at all in the case of coordinate sorted input. Since Picard doesn't explicitly group reads into their pairs until a later stage in the process it doesn't have the same input sorting performance hit that MarkDuplicatesSpark has.

    @mack812 I would love to see the logs/failures from your unsuccessful runs of MarkDuplicatesSpark with less memory. Our testing found that most 30x WGS bams worked comfortably on 20GB of RAM or less, though the memory usage scales with library complexity and 34.5% is high so its possible that is the source of your issues. I would like to diagnose your memory issues. I wouldn't expect the change in the --read-name-regex field to significantly impact the runtime compared to Picard MarkDuplicates.

    The primary issue I see with your wdl is that you are using disks: "local-disk 62 HDD", we found that MarkDuplicatesSpark requires low latency disk to run efficiently (indeed most of our spark tools do). I would recommend changing that line to disks: "local-disk 62 LOCAL" which requests an SSD that is physically on the same rack as your cpu and thus has significantly higher throughput compared to HDD and SSD which are both cloud persistent disks that require network IO to use behind the scenes. You can read more about the disk pricing/properties here: https://cloud.google.com/compute/docs/disks/. If you are on a new version of Cromwell then requesting over 375GB of LocalSSD will automatically request enough local disks to meet your needs. You should also consider adding --conf 'spark.local.dir=./tmp' to your gatk command. This ensures that spark is dumping its temp files onto a fast disk (the main local-disk that you requested) and not the boot disk (which is guaranteed to be both slow and small).

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭✭
    edited February 2019

    A further note on disk speed. Google cloud disk speed is complex and unintuitive. Speed for persistent disk is proportional to the disk size. I believe the recommendation is that 1TB of HDD is about the same speed as a physical HDD. A 62 GB disk will be limited to something like 6% of a regular disk throughput, which is super slow. It's not totally linear and depends on the machine size as well, but I think the recommendation is to not use something smaller than 1TB of HDD for any disk intensive operation. See this page for more information

    In any case, we definitely recommend using LOCAL ssd's. They're more expensive but the speedup is usually worth it. You also get a discount on LOCAL disks when running pre-emptable machines while persistent disk remains the same cost.

  • @shlee and @emeryj
    thanks for all your usefull help and discussion. I am using MDSpark directly after mapping with BWA mem now and can see a clear time reduction and advantage compared to the non-Spark MD.
    Thanks again to all for helping out with this

  • mack812mack812 SpainMember

    Thanks so much @emeryj and @LouisB for this useful information. I will take this into account. I was trying to reduce expenses by sizing dynamically the disk space, as done by many of the wdl's in the best practices workflows (like the five dollars genome one), but I will switch to local disks for the Spark tools following your advice (thanks!).

    I just re-tried the run with exactly the same settings but now with the local disk configuration on the runtime section of the task and the extra --conf 'spark.local.dir=./tmp' argument. A "Local SSD scratch disk" 375GB in size was available "inside" the VM machine console window. Overall there was a substantial improvement: now only 47.63 minutes of elapsed time (nearly 1 hour from VM creation to output). Reading the log it seems that the "first part" took only 11 minutes this time (wow!), while the "second part" (merging files etc.) was not improved and remained at 37 minutes. I am attaching the tool's log.

    I also did another run with less RAM (52 GB) but the MDSpark tool failed again. This time it just got stuck during the "first part" (also attaching the tool's log from this run). The log did not move pass that last line in the log. After 10 minutes of the run being stuck there I decided to abort the run. There was no error prompted when I aborted, it just got "frozen" at that point.

    I have noticed that one of the first lines of the 52GB run's log says "INFO MemoryStore: MemoryStore started with capacity 28.5 GB", while when using 64 GB RAM the same line is "INFO MemoryStore: MemoryStore started with capacity 34.6 GB". It seems like the spark engine is only using around 54% of the RAM available in the VM machine (even if I am setting the JVM -Xms heap size at machine mem - 2GB), and that this specific bam needs somewhere between 28.5 and 34.6 GB RAM to succeed at the MDSpark run. Can the memory available to spark be increased by adding another "--conf" line?


  • @emryj: I am running MArkDuplicatesSpark immediately following MergeBamAlignment (Sort-Order=queryname) You state above " inputs that are either queryname sorted or querygrouped as it needs to group read pairs together. "

    Will MarkDuplicatesSpark be faster if the bam is querygrouped ? How do i get MergeBamAlignment to use sort order "querygroup". There is no such option for MergerBamAlignment.

    You continue to mention: Consequently we haven't looked into optimizing the queryname grouped use case. What does "queryname grouped" mean??

  • emeryjemeryj Member, Broadie

    Hello, @jaideepjoshi. You caught an error in my language, I meant to say "Consequently we haven't looked into optimizing the coordinate sorted use case". I apologize for the confusion.

    Query grouped is not a sort order strictly, usually the sam file will say the SO (sort order) is unsorted with the GO (group order) tag set to queryname. What it means is that all reads with the same name are together in the file even if those names themselves are unsorted amongst themselves. This queryname sort is a subset of query grouped ordering. Either way MergeBamAlignment (Sort-Order=queryname) should be sufficient to run MarkDuplicates quickly.

    @mack812 The breakdown of runtime for MarkDuplicates itself vs. file-writing/merging is basically what we expect, unfortunately merging n-partitions into the same file on unix requires a slow single core process. The solution to this is probably to try using NIO output there, but that is currently a work in progress in spark and may have other issues. There are improvements to NIO coming in the next version or so of gatk that might speed this up substantially.

    The errors you are seeing with MarkDuplicates at sub 64 GB look like they may be some other issue than memory for gatk. Typically when spark tools run low on memory you can see in the log that spark starts sputtering endlessly spilling tiny chunks of its RDDs to disk until it possibly unceremoniously dies with some memory allocation related stack-trace. Thats not what I see in your logs, they just die in the middle of what looks like a typical run. I suspect something outside of the jvm is dying. To this I would recommend trying again with less ram but using the -Xmx command rather than -Xms, as this sets the upper bound of memory java will grab rather than the lower bound, which might save you if the issue is that the off-heap memory usage is dying because it only has access to 2 GB of space total. Next I would try lowering the value itself to be machine mem - 5G just to rule this out. I hope this helps your problem.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jaideepjoshi,

    How do i get MergeBamAlignment to use sort order "querygroup". There is no such option for MergerBamAlignment.

    To leave the aligner's sort order intact (query-group for BWA), you can use MergeBamAlignment's SORT_ORDER=unsorted. This implementation is used in our production pipeline described here. This should be the most efficient path.

    Specifying a queryname sort-order with MergeBamAlignment's --SORT_ORDER queryname then must ask the tool to consciously sort the reads, costing compute. There is another sort-order, --SORT_ORDER unknown and the tool's default output sort-order is coordinate.

    You can read more about MergeBamAlignment in general at https://gatkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-short-read-sequence-data-efficiently/p1. There is considerable follow-up discussion as well that may help clarify any questions that arise.

  • mack812mack812 SpainMember

    Hi @emeryj,

    I applied both changes at once (using -Xmx instead of -Xms AND using machine mem = java mem + 5) and I am happy to report that it worked nicely with a VM machine of "only" 28GB RAM (and 16 vCPUs). I am attaching the tool's log for your records.

    Before this change, I was dynamically sizing RAM as java mem = 3.65 * total input bam size and machine mem = java mem + 2. This time java mem was only 1.5 * total input bam size. I guess I could try even lower values of the multiplier. Do you recommend not going under any specific RAM value?

    If you let me ask one last question: would MarkDuplicatesSpark be able to work in NIO mode (i.e. declaring the input files as Array[String] instead of Array[File]? I have tried NIO with other tools but always with coordinate sorted and indexed bams as inputs, which is not the case of my inputs at this stage of the workflow.

    Thanks so much for your help.

  • Angry_PandaAngry_Panda Member

    Hi, good question. I am trying to compare MarkDuplicates with MarkDuplicatesSprak as well. I am doing with now,but I dont mind to change to
    One problem is , I used github production code offered by Broad Institute.
    When I use MarkDuplicates, I used argument "-ASO queryname"
    But I cannot find this in MarkDupicatesSpark. without this argument, terminal complained "This program requires input that are either coordinate or query sorted. Found unsorted"
    Could I get some suggestion?
    Best, wishes.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Angry_Panda

    MarkDupicatesSpark required you to provide either coordinate-sorted or query-sorted inputs, however it is recommended that the input be query-sorted or query-grouped. See Picard's SortSam command below. You can sort a sam or bam file using queryname or readname.


  • Angry_PandaAngry_Panda Member

    Hi @bhanuGandham thanks for your answer.
    I also read and tried what you posted but funny report generated.

    What I did:
    I used FastqToSam to transfered my raw FASTQ file to uBAM file, and as tool default, it will be query (name) sorted.
    I used MarkDuplicates with these command: It worked! with argumnet "-ASO queryname"

    $ time ./gatk MarkDuplicates -I /home/cloud-user/MarkDuplicatesSpark/input/ERR194161_D0UYCACXX.4.unmapped.aligned.unsorted.bam -O /home/cloud-user/MarkDuplicatesSpark/output/marked_duplicates.bam -M /home/cloud-user/MarkDuplicatesSpark/output/marked_dup_metrics.txt -ASO queryname

    However, I want to benchmark MarkDuplicates Spark local runner with different cores.
    In MarkDuplicatesSpark, there is no -ASO argument.
    I remove it and run as follow:

    $ time ./gatk MarkDuplicates -I /home/cloud-user/MarkDuplicatesSpark/input/ERR194147_C0D8DACXX.4.unmapped.aligned.unsorted.bam -O /home/cloud-user/MarkDuplicatesSpark/output/marked_duplicates_non_spark.bam -M /home/cloud-user/MarkDuplicatesSpark/output/marked_dup_metrics_non_spark.txt

    It reported:
    [Fri Feb 01 21:49:35 EET 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 73.46 minutes.
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    picard.PicardException: This program requires input that are either coordinate or query sorted. Found unsorted
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:254)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    These uBAM files should be already queryname sorted, why now without -ASO argument, MarkDuplicatesSpark complain it?

  • Angry_PandaAngry_Panda Member

    Hi @bhanuGandham
    Sorry my fault. This problem solved.
    It is a little bit silly but maybe easy to be ignored by green hand like me.

    Tool FastqToSam did transfered raw FASTQ file to uBAM file, and as tool default, it will be query (name) sorted. However, after following gatk4 data pre-processing steps, such as SamToFastq and bwa mem. The bam file become unsorted again.

    The trick is always use samtools to check the input sort order with command: ./samtools view -H input.bam

    Sorry for my upper silly question. I don´t know how could I delect it. thus I add information here.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin


    Its no problem at all! I am glad the problem is solved.

Sign In or Register to comment.