We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
MarkDuplicates on reads with different lengths

Dear GATK team,
Am I right that since MarkDuplicates considers only 5' coordinates of reads, it should work properly on reads (both paired-end and single-end) that have different lengths (due to quality trimming from 3')?
Best Answers
-
Geraldine_VdAuwera Cambridge, MA admin
No, both ends are considered so quality trimming by hard-clipping will negate the tool's ability to identify duplicates. Soft-clipping is okay because the tool is able to use the soft-clipped sequence.<\del>edit: my original response was incorrect, see discussion below
Quality trimming is a legacy practice dating back to the time when analysis tools did not take individual base qualities into account. Now, most established tools have the ability to weigh the evidence appropriately and quality trimming is no longer useful. At this point it has more downsides than advantages, so we recommend just not doing it.
Post edited by Geraldine_VdAuwera on -
jalves Cambridge, UK ✭
@Geraldine_VdAuwera said:
No, both ends are considered so quality trimming by hard-clipping will negate the tool's ability to identify duplicates. Soft-clipping is okay because the tool is able to use the soft-clipped sequence.
Hi Geraldine. Are you sure that Markduplicates consideres both ends for the detection of duplicates? According to the Picard Wiki page it only appears to take into account the 5' coordinate of the reads:
"Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned".
Source: https://sourceforge.net/p/picard/wiki/Main_Page/#q-how-does-markduplicates-work" -
Geraldine_VdAuwera Cambridge, MA admin
You're absolutely right, @jalves -- I'm not sure what I was thinking. Sorry about that.
@SvyatoslavSidorov my answer to you was wrong, please see this correction.
I still don't recommend hard clipping for quality though.
Answers
No, both ends are considered so quality trimming by hard-clipping will negate the tool's ability to identify duplicates. Soft-clipping is okay because the tool is able to use the soft-clipped sequence.<\del>edit: my original response was incorrect, see discussion below
Quality trimming is a legacy practice dating back to the time when analysis tools did not take individual base qualities into account. Now, most established tools have the ability to weigh the evidence appropriately and quality trimming is no longer useful. At this point it has more downsides than advantages, so we recommend just not doing it.
Geraldine, thank you very much!
@Geraldine_VdAuwera said:
Hi Geraldine. Are you sure that Markduplicates consideres both ends for the detection of duplicates? According to the Picard Wiki page it only appears to take into account the 5' coordinate of the reads:
"Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned".
Source: https://sourceforge.net/p/picard/wiki/Main_Page/#q-how-does-markduplicates-work"
You're absolutely right, @jalves -- I'm not sure what I was thinking. Sorry about that.
@SvyatoslavSidorov my answer to you was wrong, please see this correction.
I still don't recommend hard clipping for quality though.
@jalves and @Geraldine_VdAuwera, thanks for your corrections!