We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Number of reads with PCR duplicates varies when checked manually and in output log file

This is the sam file post base recalibration "Drr.sam". I wanted to look for the number of lines with PCR duplicates. So at first, I checked the total number of lines in sam files then looked for lines with the tag of PCR duplicates

wc -l Drr.sam
200809 Drr.sam

grep "PG:Z:MarkDuplicates" Drr.sam | wc -l
200809

Does that mean all lines are being marked as Duplicates when running mark duplicate software?

Also when i look at the log file generated by GATK

INFO 2019-11-21 11:30:53 MarkDuplicates

********** NOTE: Picard's command line syntax is changing.


********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


********** The command line looks like this in the new syntax:


********** MarkDuplicates -INPUT sortsam/Drr_aligned_sorted.bam -OUTPUT dupmarked/Drr_aligned_sorted_dupmarked.bam -VALIDATION_STRINGENCY LENIENT -CREATE_INDEX true -METRICS_FILE dupmarked/Drr_Output_Duplicate_metrics


11:30:53.701 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:softwares/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Nov 21 11:30:53 IST 2019] MarkDuplicates INPUT=[sortsam/Drr_aligned_sorted.bam] OUTPUT=dupmarked/Drr_aligned_sorted_dupmarked.bam METRICS_FILE=dupmarked/Drr_Output_Duplicate_metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX= OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Nov 21 11:30:53 IST 2019] Executing as [email protected] on Linux 4.15.0-62-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.7-SNAPSHOT
INFO 2019-11-21 11:30:53 MarkDuplicates Start of doWork freeMemory: 430168760; totalMemory: 442499072; maxMemory: 6541541376
INFO 2019-11-21 11:30:53 MarkDuplicates Reading input file and constructing read end information.
INFO 2019-11-21 11:30:53 MarkDuplicates Will retain up to 23701236 data points before spilling to disk.
WARNING 2019-11-21 11:30:53 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: Drr.100611. Cause: String 'Drr.100611' did not start with a parsable number.
INFO 2019-11-21 11:30:56 MarkDuplicates Read 195111 records. 0 pairs never matched.
INFO 2019-11-21 11:30:57 MarkDuplicates After buildSortedReadEndLists freeMemory: 518177576; totalMemory: 737148928; maxMemory: 6541541376
INFO 2019-11-21 11:30:57 MarkDuplicates Will retain up to 204423168 duplicate indices before spilling to disk.
INFO 2019-11-21 11:30:57 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-11-21 11:30:58 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2019-11-21 11:30:58 MarkDuplicates Sorting list of duplicate records.
INFO 2019-11-21 11:30:58 MarkDuplicates After generateDuplicateIndexes freeMemory: 731980800; totalMemory: 2376073216; maxMemory: 6541541376
INFO 2019-11-21 11:30:58 MarkDuplicates Marking 85071 records as duplicates.
INFO 2019-11-21 11:30:58 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2019-11-21 11:30:58 MarkDuplicates Reads are assumed to be ordered by: coordinate
INFO 2019-11-21 11:31:03 MarkDuplicates Writing complete. Closing input iterator.
INFO 2019-11-21 11:31:03 MarkDuplicates Duplicate Index cleanup.
INFO 2019-11-21 11:31:03 MarkDuplicates Getting Memory Stats.
INFO 2019-11-21 11:31:03 MarkDuplicates Before output close freeMemory: 2399045704; totalMemory: 2409627648; maxMemory: 6541541376
INFO 2019-11-21 11:31:03 MarkDuplicates Closed outputs. Getting more Memory Stats.
INFO 2019-11-21 11:31:03 MarkDuplicates After output close freeMemory: 2421734000; totalMemory: 2431647744; maxMemory: 6541541376
[Thu Nov 21 11:31:03 IST 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.16 minutes.
Runtime.totalMemory()=2431647744

Best Answer

  • SkyWarriorSkyWarrior Turkey ✭✭✭
    Accepted Answer

    You don't check duplicates like that. That PG tag is only for information that reads were processes using a particular software.

    If you wish to check the number of duplicate marked reads use read flags 0x400.

    samtools view -c -f 0x400 bamfile.bam
    

    will give you the number of duplicate marked reads.

Answers

Sign In or Register to comment.