If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

skipping MarkDuplicates in deep sequencing data

Hi GATK team,
Recently I have been working with deep sequencing data(target capture data,about 200x depth,BAM file was sorted ,about 3.39G size),but I was stucked in Picard MarkDuplicates , it took 85h so far and I expect it will takes more time. So can I skip MarkDuplicates and I want to know if skipping MarkDuplicates would result in some big effects in somatic mutation calling with MuTect?



  • xiaolonggexiaolongge chinaMember

    Actually,the mean depth of target capture data is about 20000x , not 200x above.

  • SheilaSheila Broad InstituteMember, Broadie admin


    We don't recommend skipping the MarkDuplicates step because duplicates that contain errors can cause false positives.

    Have a look at this section and this section of the MarkDuplicates tutorial for more tips.


  • xiaolonggexiaolongge chinaMember

    Thanks for your reply.I know MarkDuplicates can largely control false positives.I just can't believe its runtime though set -Xmx#G and TMP_DIR even -XX:ParallelGCThreads to add threads.Is it because the numerous duplicate reads of my data?(74,635,747 pair-end reads,about 69.6% duplication rate )

    One more thing,in deep sequencing,a genomic region can be sequenced multiple times, sometimes hundreds or even thousands of times,thus result in lots of duplictes.However,MarkDuplicates tag these duplicates as artifacts and those duplicate reads can't work any more even as evidence of coverage depth.I can' understand that?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you're using an amplicon-based deep sequencing design, then we don't recommend marking duplicates.

Sign In or Register to comment.