Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Is removing duplicates appropriate with pooled data?

jesstillajesstilla Medford, MAMember

Hello, I am a graduate student in lab that studies evolution, and I am relatively new to NGS. I have been given reads from pooled moth samples, and I am hoping to identify variants with the ultimate goal of quantifying the genetic differentiation between two strains of moths. I am wondering 1) if it is appropriate/recommended to remove duplicates with pooled data and 2) more broadly, are there particular situations in which removing duplicates is not suggested? For example, I have another data set in which the fragments were not generated by random shearing but rather by multiplex PCR of 17 particular amplicons for 42 different individual moths (not pooled). I'm guessing that removing duplicates doesn't make sense in this case because there will be lots of reads that start at the exact same position relative to the reference. Is this right?

Thanks a bunch!



  • CarneiroCarneiro Charlestown, MAMember admin

    No, we do not recommend removing duplicates in pooled data or in multiplexed PCR.

  • jesstillajesstilla Medford, MAMember

    Thanks for your answer! I assume this is because I might remove duplicates that are from different individuals (instead of the result of PCR duplication) but if the fragments were generated by random shearing, isn't it fairly unlikely that I will happen to have two fragments from different individuals with the same starting position relative to the reference? I.e. doesn't the same logic apply? Or am I missing something?

  • EvaEva Member

    Dear Carneiro, when it comes to pooled samples, is there any other recommended way that is able to differentiate between PCR duplicates and the real fragments?

  • AlexanderVAlexanderV BerlinMember

    The question about recalibration with multiplexed samples was discussed a lot here:

    But about this duplication marking I could not find so much.
    This thread clearly states (and it makes sense) to not run deduplication on pooled samples.
    But what is with demultiplexed samples from a pooled/multiplexed run?

    So, let's say I have 10 samples in one lane, demultiplex them and get 10 files each for
    one sample. Would it be bad to run deduplication on each of them? Why/Why not?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @AlexanderV There is no problem with dedupping samples that were multiplexed -- that is what you should do. The warning is against dedupping pooled samples that are not individually barcoded, which is a different usage of the term.

  • AlexanderVAlexanderV BerlinMember

    @Geraldine_VdAuwera I missed you answer, just read it now.
    You mean, to run the deduplication on the individual alignments (demultiplexed, one file per sample) is fine?

    On the other hand: When the data was generated with RADseq, all the reads start at the same position because it is determined by the used enzyme. I get many real duplications, because a lot of DNA is used. So in this case, it is a bad idea again to use deduplication?
    But In fact I get a read distributions at a loci of ~40/60 %. So, reads for Allele1 40%, for allele2 60%. On SNP calling, some of these sites don't get called because it deviates too much. I assume, because from the PCR step, an imbalance was created. And this in turn would speak for using the deduplication.
    So - what to do?

  • SheilaSheila Broad InstituteMember, Broadie admin



    Yes to your first question.

    I am not sure I understand what you mean by "On SNP calling, some of these sites don't get called because it deviates too much."

    Probably the best thing to do here is to compare the results of marking duplicates and not marking duplicates and decide which gives better results.


  • AlexanderVAlexanderV BerlinMember
    edited July 2015

    I ran deduplication and found this (see attached picture).
    The data is as expected pretty messed up, because MarkDuplicates also removes/marks true molecular duplicates, not just the ones from PCR (rhetorically: how could it).

    Remind: The reads don't come from random shearing of the source DNA, but from digestion with an enzyme, which has a specific cut site.

    Still in there are some PCR duplicates that create an imbalance.
    What I mean is:
    E.g. we have 15 real molecules from digestion at this postion. 8 with allele1, 7 with allele2
    It can just be a small difference from 50/50 because we have polyploidy. A small difference though might be there because we're talking about digestion - a biological process.
    It gets PCRed in a few rounds, which might leave us with 60 pieces.
    And after sequencing this pottage, we get 38 with allele1, 22 with allele2.
    Because we have 63% reads speaking for one allele and 37% for another allele, its quite a bit away from the peak probability at 50/50 for a het call.

    So. This imbalance/deviance that was created by PCR cant be reduced with MarkDuplicates. Is there another way to approach this? Maybe tweaking some parameter during genotyping, to alow the model a broader deviance from the theoretical 50/50 distribution for het calls?

    I hope, I was able to explain my problem better. Otherwise please ask, I will try to clarify.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately deduplication is not appropriate for RADseq data because the way the data is generated (enzymatic digests at specific sites) confuses the algorithm. On this type of data it's better not to dedup.

  • MariannnnaMariannnna ItalyMember

    Hi all,

    I'm working with not-multiplexed pooled samples.
    I mean: a barcoded sample is a pool of three different animals.
    If I understood well, the dedupping is not recommended. Isn't it?
    The RNAseq sequencing libraries on which I'm trying to call variants showed a high percentage of duplicated reads (70-80%).

    Thank you all!!

  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Marianna,

    You are correct you should not mark duplicates when you cannot distinguish between individuals in your sample.


  • MariannnnaMariannnna ItalyMember


    Hi Sheila,
    even if the duplicate reads are about 80%? (This is unfortunately due also to low complexity of my target transcriptome).
    I guess I'm dealing with a really complicate situation :-(

    You're really kind!

  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Marianna,

    Yes, unfortunately there is no way for the algorithm to distinguish between PCR duplicates and real duplicates. So, marking duplicates is not recommended. Of course, you may want to compare your results from marking duplicates and not marking duplicates and see which gives better results.


Sign In or Register to comment.