If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Does marking duplicate step is needed for pooled sample?

SathyaSathya IndiaMember


I am using pooled RNA seq samples.I had doubt on marking duplicate reads. Please clear my doubt.

From the literatures and manual of SNP calling tools i have read that after mapping the next step is to mark duplicate reads. As my samples is pooled i have very high sequence duplication. So i have used picard tool to remove the duplicates. I have predicted SNPs also.

I am interested to find SNPs in only 10 genes. I have nearly 30 SNPs in those genes.

Then i tried to find SNP by skipping the mark duplicates step. For those 10 genes I have found nearly 60 SNPs.

Then i compared the 30 SNPs found earlier with this 60. All those 30 were found in this 60 with higher SNP quality and read depth.
I was confused whether mark duplicates step is needed in my case. I am giving example below. Please suggest me which is correct one.

SNP found after using mark duplicates.

153333117 C A 249.43 DP=12
153333354 C T 49.68 DP=27
74606669 T G 62.62 DP=3

SNP found by skipping mark duplicates.

153333117 C A 1496.54 DP=56
153333354 C T 105.08 DP=62
74606669 T G 425.86 DP=15

Thanks in advance.

Best Answer


  • SathyaSathya IndiaMember
    edited March 2014

    Thanks for your suggestion.

  • CarlosBorrotoCarlosBorroto ✭✭ Member ✭✭

    Hi @Sathya‌ ,

    I noticed you mentioned you are working with RNA samples. In that case current mark duplicate algorithms might not do a good job at finding true duplicates. Many genes are over sampled just because they expression level. You will have many reads starting right at the transcription starting point and possible of the same length. These reads might come from several transcripts and technically they aren't duplicates.

    See more detailed discussions in this link:

    Hope it helps,


  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    That's a good point, thanks for jumping in, Carlos. We don't have much experience with RNAseq yet so this is very helpful.

Sign In or Register to comment.