(BP1.1) Mapping and Marking Duplicates

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,169Administrator, GATK Dev admin

This article is part of the workflow documentation describing the Best Practices for Variant Discovery in DNAseq data. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

The Best Practices variant discovery workflow depends on having sequence data in the form of reads that are aligned to a reference genome. So the very first step is of course to map your reads to the reference to produce a file in SAM/BAM format. We recommend using BWA, but depending on your data and how it was sequenced, you may need to use a different aligner. Once you have mapped the reads, you'll need to make sure they are sorted in the proper order (by coordinate).

Then you can proceed to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called **dedupping** in bioinformatics slang) identifies these reads as such so that the GATK tools know they should ignore them.

These steps are performed with tools such as Samtools and Picard that are not part of GATK, so we don't provide detailed documentation of all the options available. For more details, please see those tools' respective documentations.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.