The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Picard Mark Duplicates handling of Library Information

TorontoPosts: 2

I was hoping this had been addressed already on the forum, but I've not seen a definitive answer although I have seen a similar question posed on this and other forums.

Our current mark duplicate procedure using Picard MarkDuplicates is to run merges across lane data generated from the same library. I believe this makes sense, and once duplicates are marked, then library level merges are combined to create a sample level, multi-library bam file. Any duplicates found across libraries would not be expected to be PCR duplicates but instead just identical fragments.

It's not clear though whether Picard MarkDuplicates is library aware....ie. when it does mark duplicates does it account for read pairs only from the same library, or if run against a bam merge generated from multiple libraries, will it mark any duplicates it finds.

I don't see this addressed in the documentation, so I assume that is not the case, but I have seen suggestions elsewhere that it might be so.

Tagged:

@l.heisler
Hi,

Yes, Mark Duplicates is Read Group aware. In our pipeline, we mark duplicates twice (once at the lane level then again after merging samples across lanes).

-Sheila

• TorontoPosts: 2

Hi Sheila, thanks for your response but it doesn't clearly address the question.

I'm trying to determine that if run against a merged bam file that contains multiple lanes of data from multiple libraries, if Mark Duplicates will be aware of the LB information in each readgroup and ONLY mark duplicates that are found within any given LB, as opposed to marking any duplicates found across all lanes, irrespective of LB.

We currently do a mark duplicate step on merged bams on lanes generated from the same library. This is followed by a second merge across libraries without duplicate marking. IF MarkDuplicates is aware of the libraries, and has this behaviour, then simply running it against a final merge from multiple lanes/multiple libraries would serve the same purpose.

-Larry

• WageningenPosts: 1

Hi Geraldine,

A related question on data of one sample merged across lanes, does the optical duplicate detection take into account the lane info in the read name or does it just take tile and coordinate info (i.e. counting the the same read on X/Y on lane 1 and X/Y on lane 2 as optical duplicate). It's not quite clear to me as the read name regex for MarkDuplicates has to cover the whole readname, but the manual says read names are parsed to extract three variables: tile/region, x coordinate and y coordinate, therefore loosing the lane info.

Thx
Judith

#### Issue · Github November 2015 by Sheila

Issue Number
312
State
open
Last Updated
Assignee
Array
Milestone
Array

Hi Judith,

Sorry for the late response, we were very busy preparing a local workshop.

I checked the code and it seems you're correct that lane information is not used. The code documentation for the read name regex is the following:

Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

I'm not sure why we don't use lane information. My initial thought was that it's because the processing is done per-lane in the pipeline (so lane info is irrelevant there), but since we do a second per-sample round of MarkDuplicates after aggregating per-lane bams, it does seem like lane would be relevant at that point. I'll ask the devs to shed some light on this.

Geraldine Van der Auwera, PhD