We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

skipping MarkDuplicates in deep sequencing data

Hi GATK team,
Recently I have been working with deep sequencing data(target capture data,about 200x depth,BAM file was sorted ,about 3.39G size),but I was stucked in Picard MarkDuplicates , it took 85h so far and I expect it will takes more time. So can I skip MarkDuplicates and I want to know if skipping MarkDuplicates would result in some big effects in somatic mutation calling with MuTect?



  • xiaolonggexiaolongge chinaMember

    Actually,the mean depth of target capture data is about 20000x , not 200x above.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    We don't recommend skipping the MarkDuplicates step because duplicates that contain errors can cause false positives.

    Have a look at this section and this section of the MarkDuplicates tutorial for more tips.


  • xiaolonggexiaolongge chinaMember

    Thanks for your reply.I know MarkDuplicates can largely control false positives.I just can't believe its runtime though set -Xmx#G and TMP_DIR even -XX:ParallelGCThreads to add threads.Is it because the numerous duplicate reads of my data?(74,635,747 pair-end reads,about 69.6% duplication rate )

    One more thing,in deep sequencing,a genomic region can be sequenced multiple times, sometimes hundreds or even thousands of times,thus result in lots of duplictes.However,MarkDuplicates tag these duplicates as artifacts and those duplicate reads can't work any more even as evidence of coverage depth.I can' understand that?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you're using an amplicon-based deep sequencing design, then we don't recommend marking duplicates.

Sign In or Register to comment.