We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Picard Mark Duplicates handling of Library Information

I was hoping this had been addressed already on the forum, but I've not seen a definitive answer although I have seen a similar question posed on this and other forums.

Our current mark duplicate procedure using Picard MarkDuplicates is to run merges across lane data generated from the same library. I believe this makes sense, and once duplicates are marked, then library level merges are combined to create a sample level, multi-library bam file. Any duplicates found across libraries would not be expected to be PCR duplicates but instead just identical fragments.

It's not clear though whether Picard MarkDuplicates is library aware....ie. when it does mark duplicates does it account for read pairs only from the same library, or if run against a bam merge generated from multiple libraries, will it mark any duplicates it finds.

I don't see this addressed in the documentation, so I assume that is not the case, but I have seen suggestions elsewhere that it might be so.

Best Answer


  • SheilaSheila Broad InstituteMember, Broadie admin


    Yes, Mark Duplicates is Read Group aware. In our pipeline, we mark duplicates twice (once at the lane level then again after merging samples across lanes).


  • l.heislerl.heisler TorontoMember

    Hi Sheila, thanks for your response but it doesn't clearly address the question.

    I'm trying to determine that if run against a merged bam file that contains multiple lanes of data from multiple libraries, if Mark Duplicates will be aware of the LB information in each readgroup and ONLY mark duplicates that are found within any given LB, as opposed to marking any duplicates found across all lanes, irrespective of LB.

    We currently do a mark duplicate step on merged bams on lanes generated from the same library. This is followed by a second merge across libraries without duplicate marking. IF MarkDuplicates is aware of the libraries, and has this behaviour, then simply running it against a final merge from multiple lanes/multiple libraries would serve the same purpose.


  • jrissejrisse WageningenMember

    Hi Geraldine,

    A related question on data of one sample merged across lanes, does the optical duplicate detection take into account the lane info in the read name or does it just take tile and coordinate info (i.e. counting the the same read on X/Y on lane 1 and X/Y on lane 2 as optical duplicate). It's not quite clear to me as the read name regex for MarkDuplicates has to cover the whole readname, but the manual says read names are parsed to extract three variables: tile/region, x coordinate and y coordinate, therefore loosing the lane info.


    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Judith,

    Sorry for the late response, we were very busy preparing a local workshop.

    I checked the code and it seems you're correct that lane information is not used. The code documentation for the read name regex is the following:

    Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

    I'm not sure why we don't use lane information. My initial thought was that it's because the processing is done per-lane in the pipeline (so lane info is irrelevant there), but since we do a second per-sample round of MarkDuplicates after aggregating per-lane bams, it does seem like lane would be relevant at that point. I'll ask the devs to shed some light on this.

  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Judith,

    MarkDuplicates is read group aware. A read group is a sample/library in a specific lane. If you look at the OpticalDuplicateFinder, the first comparison is that the two reads are in the same read group, thus, the same lane.

    So, if your read group identifies the lane, you will be fine.


Sign In or Register to comment.