Error in ValidateSamFile when multiplexing MarkDuplicates

olavurolavur Member
edited May 2017 in Ask the GATK team

I'm using GATK 3.7 and Picard v2.9.2 and when passing multiple input BAMs to MarkDuplicates (my data is multiplexed), I get an error when trying to validate the resulting BAM file using ValidateSamFile. I've included my MarkDuplicates and the ValidateSamFile command and their output.

Note that at the moment I am temporarily using Java OpenJDK v1.8. If it is a possibility that this is causing the error, I'll just have to wait until I can try it with Java Oracle.

I used the methods described in Tutorial#6483 to map and clean up the reads.

The MarkDuplicates command:

java -jar $PICARD MarkDuplicates \
    INPUT=318616_S1_L001_sorted.bam \
    INPUT=318616_S1_L002_sorted.bam \
    OUTPUT=318616_S1_dedup.bam \
    METRICS_FILE=318616_S1_dedup_metrics.txt

Gives the output:

[Tue May 16 12:46:32 WEST 2017] picard.sam.markduplicates.MarkDuplicates INPUT=[318616_S1_L001_sorted.bam, 318616_S1_L002_sorted.bam] OUTPUT=318616_S1_dedup.bam METRICS_FILE=318616_S1_dedup_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue May 16 12:46:32 WEST 2017] Executing as olavur@hnpv-fargenCompute01 on Linux 4.4.0-72-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13; Picard version: 2.9.2-SNAPSHOT
INFO    2017-05-16 12:46:32     MarkDuplicates  Start of doWork freeMemory: 247002616; totalMemory: 253231104; maxMemory: 3736076288
INFO    2017-05-16 12:46:32     MarkDuplicates  Reading input file and constructing read end information.
INFO    2017-05-16 12:46:32     MarkDuplicates  Will retain up to 13536508 data points before spilling to disk.
INFO    2017-05-16 12:46:40     MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:07s.  Time for last 1,000,000:    7s.  Last read position: 3:46,939,289
INFO    2017-05-16 12:46:40     MarkDuplicates  Tracking 34300 as yet unmatched pairs. 1970 records in RAM.
INFO    2017-05-16 12:46:46     MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:13s.  Time for last 1,000,000:    6s.  Last read position: 6:167,786,684
INFO    2017-05-16 12:46:46     MarkDuplicates  Tracking 52092 as yet unmatched pairs. 130 records in RAM.
INFO    2017-05-16 12:46:52     MarkDuplicates  Read     3,000,000 records.  Elapsed time: 00:00:19s.  Time for last 1,000,000:    6s.  Last read position: 11:55,321,871
INFO    2017-05-16 12:46:52     MarkDuplicates  Tracking 53094 as yet unmatched pairs. 3924 records in RAM.
INFO    2017-05-16 12:46:57     MarkDuplicates  Read     4,000,000 records.  Elapsed time: 00:00:25s.  Time for last 1,000,000:    5s.  Last read position: 16:22,358,872
INFO    2017-05-16 12:46:57     MarkDuplicates  Tracking 39568 as yet unmatched pairs. 4046 records in RAM.
INFO    2017-05-16 12:47:04     MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    6s.  Last read position: 22:50,518,158
INFO    2017-05-16 12:47:04     MarkDuplicates  Tracking 14634 as yet unmatched pairs. 142 records in RAM.
INFO    2017-05-16 12:47:05     MarkDuplicates  Read 5205808 records. 0 pairs never matched.
INFO    2017-05-16 12:47:06     MarkDuplicates  After buildSortedReadEndLists freeMemory: 1438835464; totalMemory: 2132279296; maxMemory: 3736076288
INFO    2017-05-16 12:47:06     MarkDuplicates  Will retain up to 116752384 duplicate indices before spilling to disk.
INFO    2017-05-16 12:47:06     MarkDuplicates  Traversing read pair information and detecting duplicates.
INFO    2017-05-16 12:47:07     MarkDuplicates  Traversing fragment information and detecting duplicates.
INFO    2017-05-16 12:47:07     MarkDuplicates  Sorting list of duplicate records.
INFO    2017-05-16 12:47:08     MarkDuplicates  After generateDuplicateIndexes freeMemory: 2103791192; totalMemory: 3064463360; maxMemory: 3736076288
INFO    2017-05-16 12:47:08     MarkDuplicates  Marking 2637489 records as duplicates.
INFO    2017-05-16 12:47:08     MarkDuplicates  Found 13624 optical duplicate clusters.
INFO    2017-05-16 12:47:08     MarkDuplicates  Reads are assumed to be ordered by: coordinate
INFO    2017-05-16 12:48:24     MarkDuplicates  Before output close freeMemory: 3037617104; totalMemory: 3065511936; maxMemory: 3736076288
INFO    2017-05-16 12:48:24     MarkDuplicates  After output close freeMemory: 2980877992; totalMemory: 3008364544; maxMemory: 3736076288
[Tue May 16 12:48:24 WEST 2017] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 1.87 minutes.
Runtime.totalMemory()=3008364544

And the ValidateSamFile command:

$ java -jar $PICARD ValidateSamFile I=318616_S1_dedup.bam MODE=SUMMARY
[Tue May 16 13:17:11 WEST 2017] picard.sam.ValidateSamFile INPUT=318616_S1_dedup.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue May 16 13:17:11 WEST 2017] Executing as olavur@hnpv-fargenCompute01 on Linux 4.4.0-72-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13; Picard version: 2.9.2-SNAPSHOT
[Tue May 16 13:17:16 WEST 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.08 minutes.
Runtime.totalMemory()=1243611136
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  1: NS500347:4:H2CKVAFXX:1:21304:16813:12821
        at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
        at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
        at htsjdk.samtools.SamFileValidator$CoordinateSortedPairEndInfoMap.remove(SamFileValidator.java:765)
        at htsjdk.samtools.SamFileValidator.validateMateFields(SamFileValidator.java:499)
        at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:297)
        at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:215)
        at htsjdk.samtools.SamFileValidator.validateSamFileSummary(SamFileValidator.java:143)
        at picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:196)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Best Answer

  • olavurolavur Member
    Accepted Answer

    The problem solved itself, it turned out to be human error (i.e. me).

    In case it is of any interest, I think the problem was that instead of inputting lane 1 and 2 to MarkDuplicates, I input lane 1 twice, thus confusing Picard.

Answers

Sign In or Register to comment.