Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MarkDuplicates: avoid excessive duplicate set size?


I am facing an issue that I do not understand using MarkDuplicates .

I have 2 bam files produced from mRNA .fastq files with the same protocol (GATK Best Practices):

  • file A is 20GB (HiSeq 2500); aligned with STAR; sorted by coordinates
  • file B is 10GB (HiSeq 4000), aligned with STAR; sorted by coordinates

Command line used for both files:
picard.sam.markduplicates.MarkDuplicates \
INPUT=[file.bam] \
OUTPUT=file_dedupe.bam \
METRICS_FILE=file_dedupe_metrics.txt \
PROGRAM_RECORD_ID=MarkDuplicates \
QUIET=false \

MarkDuplicates took about 12 hours on the largest file A, with a minimum and a maximum 'Large duplicate set. size' of 1,001 and 145,066 respectively.

For the smaller file B, it has been running for 1 week and it is far to be finished.
Everything ran fine until a monster set of size 19,220,000 appeared. The 'ReadEnds to keeper' step was very quick. But the 'ReadEnds to others' is taking about 11 minutes for processing 1,000 reads. If the rate is constant it should take not less than 132 days to complete this set !

I am running the program with 150GB memory and 24 cpus, set the -Xmx32G and a large temporary file. Increasing PIXEL_DISTANCE to 2500 does not change anything.

(1) Why a a smaller file would take forever to complete? Does it means I have a big "clump" of duplicates?
(2) Is there a way to apply a threshold to the duplicate set size?
(3) Is HiSeq 4000 flow cell not suppose to avoid optical duplicates? Is it safe to skip the optical duplicate detection when bam files will be used to detect variants and gene expression downstream?

Thanks !


Best Answer


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @user31888,

    (1) Why a smaller file would take forever to complete? Does it means I have a big "clump" of duplicates?

    Are you running MarkDuplicates on both files at the same time and is there enough unused cores on your system, e.g. 48 cores and 62G java heap size? When a process runs out of system memory, it spills to disk, which can really slow the process down (see this doc section). You can find your system's default maximum heap size with java -XX:+PrintFlagsFinal -version. Look for the line with MaxHeapSize.

    Such memory issues would especially be the case for RNA-Seq data where the duplicate set sizes can get huge for loci with high expression. For RNA-Seq, these "clumps of duplicates" as you describe do not necessarily represent duplicates in the sense that we intend MarkDuplicates for. However, marking duplicates shapes RNA-seq data to approximately resemble that of DNA data so that our tools, which have been mostly configured to handle DNA-Seq data, can process RNA-seq data efficiently for variant calling.

    Can you check that your BAM header indicates that the file is coordinate sorted? If it doesn't then be sure to use the ASSUME_SORT_ORDER=coordinate option and do not specify the ASSUME_SORTED parameter if you were previously doing so. In more recent releases of Picard, MarkDuplicates has been enhanced to take query-grouped BAMs. If MarkDuplicates thinks your BAM is query-group sorted, even though it is coordinate sorted, then, well, perhaps this could add to its memory use.

    Alternatively, given MarkDuplicates can now take query-grouped alignment files, well, it may be that this would require less memory and would finish sooner for you. This is in fact our new preferred approach that we now use in production as we outline in this workflow. There are other reasons that may be of interest to you to mark duplicates in this manner. To quote the doc:

    When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

    If you decide to take the query-grouped approach, be sure to study the linked workflow for implications for MergeBamAlignment and SetNmUqAndMdTags, if you are using a similar approach with these tools. And let us know if there is an improvement in memory utilization.

    (2) Is there a way to apply a threshold to the duplicate set size?

    I'm not aware of any parameter that can change this. I think you are asking this so as to reduce the memory burden. What comes to mind is changing your paired end data, if it is indeed paired end, to single ended data to reduce the burden on memory. This will likely have the effect of (i) marking more duplicate reads (more sets with increased membership as definition of duplicate is no longer insert but rather read start and end) but also (ii) reduce the memory burden (reduce the genomic interval and thus the amount of records the tool must keep in memory). If you decide on such an approach, then you would want to carefully consider your DUPLICATE_SCORING_STRATEGY so as not to penalize reads with putative true variants, e.g. use the RANDOM strategy.

    (3) Is HiSeq 4000 flow cell not suppose to avoid optical duplicates? Is it safe to skip the optical duplicate detection when bam files will be used to detect variants and gene expression downstream?

    MarkDuplicates marks all types of duplicates, including optical duplicates. That is, MarkDuplicates, when flagging duplicates, does not distinguish the duplicate types and flags all duplicates. However, the library size estimation looks for optical duplicates within the duplicate pool. You can skip optical duplicate finding by setting READ_NAME_REGEX=null. Since estimating library complexity is a moot point for RNA-Seq expression data, skipping optical duplicate detection should have no impact for your study aims. So yes, totally safe to skip this.

    I hope this is helpful. Let me know if I can clarify anything or if you'd like me to look further into something that is important to your research. Happy holidays.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    P.S. @user31888, take a look at section 6.1 for an unsanctioned command to switch PE reads to SE reads.

Sign In or Register to comment.