It looks like you're new here. If you want to get involved, click one of these buttons!
This article is part of the workflow documentation describing the Best Practices for Variant Discovery in DNAseq data. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.
The Best Practices variant discovery workflow depends on having sequence data in the form of reads that are aligned to a reference genome. So the very first step is of course to map your reads to the reference to produce a file in SAM/BAM format. We recommend using BWA, but depending on your data and how it was sequenced, you may need to use a different aligner. Once you have mapped the reads, you'll need to make sure they are sorted in the proper order (by coordinate).
Then you can proceed to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called **dedupping** in bioinformatics slang) identifies these reads as such so that the GATK tools know they should ignore them.
These steps are performed with tools such as Samtools and Picard that are not part of GATK, so we don't provide detailed documentation of all the options available. For more details, please see those tools' respective documentations.
Geraldine Van der Auwera, PhD