Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

GATK Staff

MarkDuplicates takes too long to be finished within 48 hours

I've been trying to run MarkDuplicates on a MergeBamAlignment-produced bam file, which was derived from a paired-end RNA-seq dataset. However, the program couldn't be finished even after running 48 hours on the Stampede supercomputer system. The size of the bam file is 6.5 gb. Below is the command I used:

$WORK/tools/jre1.8.0_91/bin/java -Xmx128G -jar $WORK/tools/picard-tools-2.4.1/picard.jar MarkDuplicates \
INPUT=$WORK/GATK/XHD1/XHD1-MergeBamAlignment.bam OUTPUT=$WORK/GATK/XHD1/XHD1_markduplicates.bam \

After running the program on individual chromosome's bam files, I found that the reads mapped to chr 17 bogged down the program. The output shows messages like following:
INFO 2016-07-18 17:10:24 OpticalDuplicateFinder compared 37,000 ReadEnds to others. Elapsed time: 00:26:52s. Time for last 1,000: 42s. Last read position: 0:7,276

It appears there are several extremely large sets of duplicates mapped to chr 17.

Are there any solutions to this problem? Any help will be highly appreciated.



Sign In or Register to comment.