We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Does GATK4 v4.1.1.0 support MarkDuplicates?

I've run GATK3.5 with `MarkDuplicates`, but can't get it to run with GATK4 v4.1.1.0. I double-checked the best practices for data pre-processing for variant discovery and noted that the command `MarkDuplicates` still appears there. When I checked the tool documentation index I could pull up `MarkDuplicates` for GATK4 v4.0.8.0, but not v4.1.1.0. So I'm wondering if `MarkDuplicates` is supported by GATK4 v4.1.1.0?

for i in "${strings[@]}"; do
echo "${i}"

# Mark duplicates
/ast/emb/software/gatk- MarkDuplicates \
I=/ast/emb/prjt3/aligned_data/${i}Aligned.sortedByCoord.out.bam \
O=/ast/emb/prjt3/aligned_data/${i}.dedupped.bam \


USAGE: MarkDuplicates [arguments]

Identifies duplicate reads. <p>This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads
are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library
construction using PCR. See also <a
href=removed link</a>
for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster,
incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication
artifacts are referred to as optical duplicates.</p><p>The MarkDuplicates tool works by comparing sequences in the 5
prime positions of both reads and read-pairs in a SAM/BAM file. An BARCODE_TAG option is available to facilitate
duplicate marking using molecular barcodes. After duplicate reads are collected, the tool differentiates the primary
and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores (default method).</p>
<p>The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for
each read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
If you are not familiar with this type of annotation, please see the following <a
href=removed link</a> for additional information.</p><p>Although the
bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of duplicate.
To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in the 'optional
field' section of a SAM/BAM file. Invoking the TAGGING_POLICY option, you can instruct the program to mark all the
duplicates (All), only the optical duplicates (OpticalOnly), or no duplicates (DontTag). The records within the output
of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked TAGGING_POLICY), as either
library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ). This tool uses the
READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options as the primary methods to identify and differentiate
duplicate types. Set READ_NAME_REGEX to null to skip optical duplicate detection, e.g. for RNA-seq or other data where
duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate
counts, library size estimation will be inaccurate.</p> <p>MarkDuplicates also produces a metrics file indicating the
numbers of duplicates for both single- and paired-end reads.</p> <p>The program can take either coordinate-sorted or
query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of
mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is
query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the
duplication test and can be marked as duplicate reads.</p> <p>If desired, duplicates can be removed using the
REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options.</p><h4>Usage example:</h4><pre>java -jar picard.jar
MarkDuplicates \
I=input.bam \
O=marked_duplicates.bam \

M=marked_dup_metrics.txt</pre>Please see <a
href=remved link#DuplicationMetrics'>MarkDuplicates</a> for
detailed explanations of the output metrics.<hr />

Required Arguments:

--INPUT,-I:String One or more input SAM or BAM files to analyze. Must be coordinate sorted. This argument
must be specified at least once. Required.

****************REMOVED STANDARD HELP INFO TO SHORTEN OUTPUT****************************

Invalid argument 'I=/ast/emb/prjt3/aligned_data/S1233686Aligned.sortedByCoord.out.bam'.
Tool returned:

The output suggests that `MarkDuplicates` is supported. I hope I didn't make a silly syntax error. I did double-check that my input file exists.

Best Answer

  • embemb
    Accepted Answer
    After leaving this and coming back to it, I realize that I did make a silly mistake when updating my script. I called `ast/emb/software/gatk-` where I needed to call `ast/emb/software/picard-2.18.16-0/picard.jar`.

    Maybe this will help someone else. . . .


  • embemb Member
    Accepted Answer
    After leaving this and coming back to it, I realize that I did make a silly mistake when updating my script. I called `ast/emb/software/gatk-` where I needed to call `ast/emb/software/picard-2.18.16-0/picard.jar`.

    Maybe this will help someone else. . . .
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin


    You are absolutely right that this will help the community and thank you so much for contributing! We appreciate it. :smile:

Sign In or Register to comment.