Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

remove duplicates by running Picard's MarMarkDuplicates and MarkDuplicatesWithMateCigar

yuanzou1109yuanzou1109 NetherlandsMember

Dear all,

I use the following pipeline for removing duplicates from a sorted.bam file:

=============MarMarkDuplicates ==============
java -jar /data/home/yuan/HiCD12/realign_2017_Nv2.1/picard-tools-1.141/picard.jar MarkDuplicates I=HiCD12_aln_pe.sam_sorteed_bam.bam O=marked_duplicates M= marked-dup-metrics.txt

[Fri Feb 03 17:28:57 CET 2017] picard.sam.markduplicates.MarkDuplicates INPUT=[HiCD12_aln_pe.sam_sorteed_bam.bam] OUTPUT=marked_duplicates METRICS_FILE=marked-dup-metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_M AP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING _STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX _RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
INFO 2017-02-03 17:28:57 MarkDuplicates Start of doWork freeMemory: 2013987624; totalMemory: 2025848832; maxMemory: 28631367680
INFO 2017-02-03 17:28:57 MarkDuplicates Reading input file and constructing read end information.
INFO 2017-02-03 17:28:57 MarkDuplicates Will retain up to 110120644 data points before spilling to disk.
[Fri Feb 03 17:28:57 CET 2017] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2025848832
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 61, Read name SCRAT:459:C93EFANXX:6:2214:12849:35449, Mapped mate should have mate reference name
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:644)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:629)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:599)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:303)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:139)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

=========================MarkDuplicatesWithMateCigar ===========

java -jar /data/home/yuan/HiCD12/realign_2017_Nv2.1/picard-tools-1.141/picard.jar MarkDuplicatesWithMateCigar
INPUT=HiCD12_aln_pe.sam_sorteed_bam.bam OUTPUT=mark_dup_cig.bam METRICS_FILE=mark_dup_cig_metrics.txt

But I got error:

ERROR: Option 'OUTPUT' is required.

USAGE: MarkDuplicatesWithMateCigar [options]

Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar

Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged.
Version: 1.141(8ece590411350163e7689e9e77aab8efcb622170_1447695087)

Options:

--help
-h Displays options specific to this tool.

--stdhelp
-H Displays options specific to this tool AND options common to all Picard command line
tools.

--version Displays program version.

MINIMUM_DISTANCE=Integer The minimum distance to buffer records to account for clipping on the 5' end of the
records.Set this number to -1 to use twice the first read's read length (or 100,
whichever is smaller). Default value: -1. This option can be set to 'null' to clear the
default value.

SKIP_PAIRS_WITH_NO_MATE_CIGAR=Boolean
Skip record pairs with no mate cigar and include them in the output. Default value:
true. This option can be set to 'null' to clear the default value. Possible values:
{true, false}

BLOCK_SIZE=Integer The block size for use in the coordinate-sorted record buffer. Default value: 100000.
This option can be set to 'null' to clear the default value.

INPUT=String
I=String One or more input SAM or BAM files to analyze. Must be coordinate sorted. Default value:
null. This option may be specified 0 or more times.

OUTPUT=File
O=File The output file to write marked records to Required.

METRICS_FILE=File
M=File File to write duplication metrics to Required.

PROGRAM_RECORD_ID=String
PG=String The program record ID for the @PG record(s) created by this program. Set to null to
disable PG record creation. This string may have a suffix appended to avoid collision
with other program record IDs. Default value: MarkDuplicates. This option can be set to
'null' to clear the default value.

PROGRAM_GROUP_VERSION=String
PG_VERSION=String Value of VN tag of PG record to be created. If not specified, the version will be
detected automatically. Default value: null.

PROGRAM_GROUP_COMMAND_LINE=String
PG_COMMAND=String Value of CL tag of PG record to be created. If not supplied the command line will be
detected automatically. Default value: null.

PROGRAM_GROUP_NAME=String
PG_NAME=String Value of PN tag of PG record to be created. Default value: MarkDuplicatesWithMateCigar.
This option can be set to 'null' to clear the default value.

COMMENT=String
CO=String Comment(s) to include in the output file's header. Default value: null. This option may
be specified 0 or more times.

REMOVE_DUPLICATES=Boolean If true do not write duplicates to the output file instead of writing them with
appropriate flags set. Default value: false. This option can be set to 'null' to clear
the default value. Possible values: {true, false}

ASSUME_SORTED=Boolean
AS=Boolean If true, assume that the input file is coordinate sorted even if the header says
otherwise. Default value: false. This option can be set to 'null' to clear the default
value. Possible values: {true, false}

DUPLICATE_SCORING_STRATEGY=ScoringStrategy
DS=ScoringStrategy The scoring strategy for choosing the non-duplicate among candidates. Default value:
TOTAL_MAPPED_REFERENCE_LENGTH. This option can be set to 'null' to clear the default
value. Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH}

READ_NAME_REGEX=String Regular expression that can be used to parse read names in the incoming SAM file. Read
names are parsed to extract three variables: tile/region, x coordinate and y coordinate.
These values are used to estimate the rate of optical duplication in order to give a more
accurate estimated library size. Set this option to null to disable optical duplicate
detection. The regular expression should contain three capture groups for the three
variables, in order. It must match the entire read name. Note that if the default regex
is specified, a regex match is not actually done, but instead the read name is split on
colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be
tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements
are assumed to be tile, x and y values. Default value:
[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to
clear the default value.

OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer
The maximum offset between two duplicte clusters in order to consider them optical
duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels)
unless using later versions of the Illumina pipeline that multiply pixel values by 10, in
which case 50-100 is more normal. Default value: 100. This option can be set to 'null

====================

my data is from a single sample and only one librar, so I did not add read group to my bam file.

in MarkDuplicatesWithMateCigar, my did output option, but the error mentioned that 'Option 'OUTPUT' is required', why???

Is there anyone can fix my problem, thank you in advance;)

Tagged:

Answers

Sign In or Register to comment.