MarkDuplicateSpark is slower than normal MarkDuplicates

I was happy to hear that MarkDuplicatesSpark is out of Beta now in Version 4.1. So I tested it on one of our WES samples. Unfortunatelly it took 2 times as long as the normal MarkDuplicate. Here are the two comands I used:

gatk MarkDuplicatesSpark -I '/media/Berechnungen/0028-19.recal.bam' -O '/media/Berechnungen/0028-19.dedup.bam' --metrics-file '/media/Berechnungen/0028-19.metrics' --spark-master local[4] --verbosity ERROR --tmp-dir /media/Ergebnisse/picardtmp/

gatk MarkDuplicates -I '/media/Berechnungen/0028-19.recal.bam' -O '/media/Berechnungen/0028-19.dedup.bam' -M '/media/Berechnungen/0028-19.metrics' --TMP_DIR /media/Ergebnisse/picardtmp/

Did I do something wrong in parallelization?

Bests Stefan


  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @StefanDiederichMainz,

    Here are some questions from GATK developers that would help us determine what is going on.

    1. Can you try local[*] instead of local[4] and see what the wall-clock time is.
    2. What is your disk setup, e.g. are you using an NFS mount? Related to this, how many cores does your system have?
    3. Can you try sending the metrics file to-M /dev/null?
    4. Anything peculiar about your BAM? If the above does not shed light, then the developers ask if you would be able to share the entirety of your BAM file towards figuring out what may be going on. Directions for sharing data are at

    Thanks for testing out MarkDuplicatesSpark.

  • Hi @shlee,

    thanks for your quick answer.
    1. I did some wall-clock time measurements
    MarkDuplicates: 15 min 55 sec
    MarkDuplicatesSpark local[4]: 28 min 26 sec
    MarkDuplicatesSpark local[8]: 17 min 49 sec
    MarkDuplicatesSpark local[*]: 11 min 00 sec
    2. We are using an SSD for the calculations which is directly in the server setup. So there is no NFS mount.
    3. Redirecting to-M /dev/null does not change the wall-clock time (local[8] with -M /dev/null needs 17 min 24 sec.
    4. I tried to upload the file but had some problems. If you need the file I will give it some more tries or will upload it to our file server and send you a link for downloading...

    By using all available cores (36) it seems to be faster than the normal MarkDuplicates. But I can not always use alle the available cores because several users are using this server.

    Do you see a similar behaviour in your environment?


  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Thanks for the follow-up @StefanDiederichMainz. Scattering across 36 cores seems a bit much to be able to outperform the non-Spark tool. However, the developers mention that the metrics file is likely the bottleneck. So, we have one more thing for you to control for and this is the writing of the metrics file. It looks like although MarkDuplicates requires the metrics file, MarkDuplicatesSpark does not. A separate standalone tool, EstimateLibraryComplexity, allows for collecting the exact same metrics. The developers think that if you omit writing the metrics file, this will allow for MarkDuplicatesSpark to perform efficiently, as expected. So to be comparable in results, you can run MarkDuplicatesSpark without the metrics plus EstimateLibraryComplexity concurrently. Are the metrics important to your pipeline?

  • I do not need the metrics file in my pipeline but did not know that the spark tool does not require this option. I tested it without the -m option:

    • local[4]: 21 min 29 sec
    • local[8]: 13 min 38 sec
    • local[*]: 09 min 08 sec

    The spark tool is faster without writing a metrics file but I think the developers hopped to see a bigger effect...

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Thanks, @StefanDiederichMainz, for testing the sans metrics case. I think the last thing for us to test is your BAM file. Was there a particular error when you tried to upload to our FTP site? Any which way you can share private data safely, can you please share with us? You can direct message me within the forum if you need.

  • mack812mack812 SpainMember

    Should not be MarkDuplicatesSpark be compared to MarkDuplicates + SortSam, as shown in the following link?

    The output of MDSpark is coordinate sorted.

  • StefanDiederichMainzStefanDiederichMainz MainzMember
    edited February 13

    I uploaded the file to a fileshare folder of the University Mainz and send you a PM with the link and login data.
    One other small thing I found out is, that MDSpark does not take comma in filenames although I quoted the filepath. Is there any way to allow comma in filepath or do I have to avoid using them?

    I tested MDSpark with an unsorted SAM file as input and you are right, it works and the output is a sorted BAM file. So I can skip the SortSam tool wich will save me about 15-20 min in processing time. So MDSpark in total is faster than SortSam+MD +Indexing.
    Thanks for that hint!

    Issue · Github
    by shlee

    Issue Number
    Last Updated
  • mack812mack812 SpainMember

    Great! Glad to hear that the difference is substantial in your settings @StefanDiederichMainz

    In my case MDSpark took nearly 1 hour and a half for a large WES bam (around 140 million reads, 17GB in size), running on GCP cloud with 16 CPUs and 64 GiB RAM (it failed with 16 GiB or 32GiB RAM). I was expecting a shorter time but I might be wrong.

    @shlee, is there any way to tweak the runtime parameters and/or java options to make MDSpark run optimally?


  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @StefanDiederichMainz and @mack812,

    Stefan--we've received your information and a developer will be looking into your case. We will either have some thoughts for you today or sometime after a week because they are unavailable for the next week.

    @mack812, it's been a long while since I thought of Spark. You might find the process I outline in of interest towards monitoring how a run's workers consume resources (CPU, I/O etc) and how to ensure work is distributed equally among workers instead of just to a single worker, as appears to happen in the Spark run illustrated in the tutorial.

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭

    @mack812 That's definitely not what we'd expect. I would expect it to be both much faster and require less memory. Can you post your settings? Are you using a local SD on your cloud machine? We found very poor performance when running MDSpark using persistent disk either HDD or SSD when on google cloud.

  • mack812mack812 SpainMember

    Hi @LouisB,

    Thanks for replying.

    The length of the run that I mentioned on my previous post (1 hour and a half) is from the starting of the VM machine to the time when I got the .bam output in the bucket. A good portion of that is IO I imagine. According to the tool's log, which I am attaching, the computing elapsed time was 73.73 minutes (sorry for the confusion). If I understand correctly this log, the dup-marking task took 38 minutes while the rest of the time (35 mins) was for merging files, generating the bam indexes (bsi and bai) and I guess that also sorting the file by genomic coordinates.

    I am also attaching the .sh script generated by the GCP backend, in which you can see that the input was composed of 10 aligned bams, which were generated by a scattered run of bwa mem (I split the 2 starting fastqs in 10 pairs, which made the alignment really fast). I do not know if taking multiple bams as inputs instead of just one bam file could be messing with the expected run time, but I guess that this is the most common situation.

    One additional thing in case this could be also interfering with the expected running time: as you can see in the .sh script I did not use the optical distance and regex options. The fastqs that I am working with for this trials are from public repositories (i.e. SRA). SRA-deposited fastqs' read name is transformed into a SRR(number).1/SRR(number).2 naming, erasing the flowcell-lane-cluster position from the read ID and therefore making the marking of optical duplicates useless. Anyway, according to the metrics report that I generated with the "EstimateLibraryComplexity" tool for the dup-marked bam, the fraction of PCR dups is high on this bam (common in FFPE samples...): 34.5%. Maybe that also made the tool run slower than expected.

    Regarding the runtime parameters, this is the task configuration (I am hard scripting below the runtime parameters used):

          call MarkDuplicatesSpark {
              input_bams = input_bams,
              output_bam_basename = base_file_name + ".aligned.sorted.duplicates_marked",
              total_input_size = total_input_size,
              compression_level = compression_level,
              gatk_docker = gatk_docker,
              disk_pad = disk_pad,
              preemptible_tries = agg_preemptible_tries
          task MarkDuplicatesSpark {
            Array[File] input_bams
            String output_bam_basename
            Float total_input_size
            Int compression_level
            Int preemptible_tries
            String gatk_docker
            Int disk_pad
            Float md_disk_multiplier = 3.25
            Int disk_size = ceil(md_disk_multiplier * total_input_size) + disk_pad
            String? read_name_regex
            Int? optical_distance
            Int? mem
            Int machine_mem = if defined(mem) then mem * 1000 else 64000
            Int command_mem = machine_mem - 2000
            Int? cpu
            command {
            gatk --java-options "-Dsamjdk.compression_level=2 -Xms62000m" \
              MarkDuplicatesSpark \
                -I ${sep=' -I ' input_bams} \
                -O ${output_bam_basename}.bam \
                ${"--read-name-regex " + read_name_regex} \
                ${"--optical-duplicate-pixel-distance " + optical_distance} \
                -VS SILENT \
                -- --spark-master 'local[*]'
          runtime {
            docker: ""
            cpu: 16
            preemptible: 10
            memory: "64 GB"
            disks: "local-disk 62 HDD"
          output {
            File output_bam = "${output_bam_basename}.bam"
            File output_bai = "${output_bam_basename}.bam.bai"

    Also, there was an error in my wdl script: asking the task to output a ".bai" index instead of the ".bam.bai" that is actually created by the tool (already corrected above), which explains the final error in the tool's log.

    Previous to this run I got two failed runs when setting the runtime memory to 16 GB or 32 GB. The tool stopped suddenly and threw out a weird error very early during the run, about not being able to write on my bucket because it had the "Requester Pays" feature ON (it was OFF both times). I did not keep these previous logs.

    Sorry if I am missing something very obvious here. Thanks for your help.

  • emeryjemeryj Member, Broadie
    edited February 13

    @StefanDiederichMainz I have taken a look at your bam I would make a note that your input bam is currently coordinate sorted. MarkDuplicatesSpark is optimized for inputs that are either queryname sorted or querygrouped as it needs to group read pairs together. To get around this problem MarkDuplicatesSpark first sorts any input that isn't grouped by readnam, and then proceeds to mark duplicates as normal. I suspect this might explain your observation that MarkDuplicatesSpark is slower than advertised. MarkDuplicates is run immediately after mapping in our pipeline and thus the reads at that stage are typically grouped by readname. Consequently we haven't looked into optimizing the queryname grouped use case. It is worth noting that this problem also affects Picard MarkDuplicates, which won't mark secondary or supplementary reads at all in the case of coordinate sorted input. Since Picard doesn't explicitly group reads into their pairs until a later stage in the process it doesn't have the same input sorting performance hit that MarkDuplicatesSpark has.

    @mack812 I would love to see the logs/failures from your unsuccessful runs of MarkDuplicatesSpark with less memory. Our testing found that most 30x WGS bams worked comfortably on 20GB of RAM or less, though the memory usage scales with library complexity and 34.5% is high so its possible that is the source of your issues. I would like to diagnose your memory issues. I wouldn't expect the change in the --read-name-regex field to significantly impact the runtime compared to Picard MarkDuplicates.

    The primary issue I see with your wdl is that you are using disks: "local-disk 62 HDD", we found that MarkDuplicatesSpark requires low latency disk to run efficiently (indeed most of our spark tools do). I would recommend changing that line to disks: "local-disk 62 LOCAL" which requests an SSD that is physically on the same rack as your cpu and thus has significantly higher throughput compared to HDD and SSD which are both cloud persistent disks that require network IO to use behind the scenes. You can read more about the disk pricing/properties here: If you are on a new version of Cromwell then requesting over 375GB of LocalSSD will automatically request enough local disks to meet your needs. You should also consider adding --conf 'spark.local.dir=./tmp' to your gatk command. This ensures that spark is dumping its temp files onto a fast disk (the main local-disk that you requested) and not the boot disk (which is guaranteed to be both slow and small).

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭
    edited February 13

    A further note on disk speed. Google cloud disk speed is complex and unintuitive. Speed for persistent disk is proportional to the disk size. I believe the recommendation is that 1TB of HDD is about the same speed as a physical HDD. A 62 GB disk will be limited to something like 6% of a regular disk throughput, which is super slow. It's not totally linear and depends on the machine size as well, but I think the recommendation is to not use something smaller than 1TB of HDD for any disk intensive operation. See this page for more information

    In any case, we definitely recommend using LOCAL ssd's. They're more expensive but the speedup is usually worth it. You also get a discount on LOCAL disks when running pre-emptable machines while persistent disk remains the same cost.

  • @shlee and @emeryj
    thanks for all your usefull help and discussion. I am using MDSpark directly after mapping with BWA mem now and can see a clear time reduction and advantage compared to the non-Spark MD.
    Thanks again to all for helping out with this

  • mack812mack812 SpainMember

    Thanks so much @emeryj and @LouisB for this useful information. I will take this into account. I was trying to reduce expenses by sizing dynamically the disk space, as done by many of the wdl's in the best practices workflows (like the five dollars genome one), but I will switch to local disks for the Spark tools following your advice (thanks!).

    I just re-tried the run with exactly the same settings but now with the local disk configuration on the runtime section of the task and the extra --conf 'spark.local.dir=./tmp' argument. A "Local SSD scratch disk" 375GB in size was available "inside" the VM machine console window. Overall there was a substantial improvement: now only 47.63 minutes of elapsed time (nearly 1 hour from VM creation to output). Reading the log it seems that the "first part" took only 11 minutes this time (wow!), while the "second part" (merging files etc.) was not improved and remained at 37 minutes. I am attaching the tool's log.

    I also did another run with less RAM (52 GB) but the MDSpark tool failed again. This time it just got stuck during the "first part" (also attaching the tool's log from this run). The log did not move pass that last line in the log. After 10 minutes of the run being stuck there I decided to abort the run. There was no error prompted when I aborted, it just got "frozen" at that point.

    I have noticed that one of the first lines of the 52GB run's log says "INFO MemoryStore: MemoryStore started with capacity 28.5 GB", while when using 64 GB RAM the same line is "INFO MemoryStore: MemoryStore started with capacity 34.6 GB". It seems like the spark engine is only using around 54% of the RAM available in the VM machine (even if I am setting the JVM -Xms heap size at machine mem - 2GB), and that this specific bam needs somewhere between 28.5 and 34.6 GB RAM to succeed at the MDSpark run. Can the memory available to spark be increased by adding another "--conf" line?


Sign In or Register to comment.