The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# I don't understand the error message of picard tools markduplicates

TiHo-Hannover, GermanyMember Posts: 5

Hello, I started to work with Nextseq500 data for variant calling. At the moment each job will terminate with the following error message:
INFO 2016-06-24 12:18:47 MarkDuplicates Tracking 1032268 as yet unmatched pairs. 6094 records in RAM.
INFO 2016-06-24 12:18:52 MarkDuplicates Read 63,000,000 records. Elapsed time: 00:14:18s. Time for last 1,000,000: 5s. Last read position:
INFO 2016-06-24 12:18:52 MarkDuplicates Tracking 1040251 as yet unmatched pairs. 2676 records in RAM.
[Fri Jun 24 12:19:17 CEST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 14.90 minutes.
Runtime.totalMemory()=46430945280
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 10:
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:193)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

When I repeated the job, Picard tools crahed already after 40,000,000 records. There is enough disk space. Jobs were running on a Linux machine with 40 cores and 512GB ram. When I looked for the error message in WWW I always found a read name after "Value was put into PairInfoMap more than once. 10:". Here a read name is missing. What does "10:" mean? I used the following softwares: bwa-0.7.13, samtools-1.3.1, Picard Tools several versions 1.139-2.4.1. Could this be a data or an envrionmental problem?
Best regards
Joern

Tagged:

This suggests that you might have the same read name repeated more than once in your file. Did you perhaps merge data from multiple runs?

Try running ValidateSamFile on your file in summary mode.

Geraldine Van der Auwera, PhD

• TiHo-Hannover, GermanyMember Posts: 5

Hello, sorry for my late answer. No I did not merge mutliple runs. I have this problem for all my Nextseq samples. Normally we index 3 samples per Nextseq run and after bcl2fastq I got 3 paired end fastqs, which I filter with prinseq, align with bwa mem, convert sam to bam with samtools, sort bam with samtools and want to run picard tools markduplicates. All 3 sam/bam files were checked by ValidateSamFile without errors. All markduplicate jobs fail, if I restart a job it could be that the jobs crash after different number of records. I tried several program versions of these 3 programs. But I must also say that about 100 jobs of old samples (Miseq and Hiseq runs) ran perfectly. When there is a repeated read name, why this read name is not written in the error message?

#### Issue · Github June 2016 by Sheila

Issue Number
1035
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

@nexejwre -- Let me add to this conversation. Two oddities are:

1. MarkDuplicates Tracking 1032268 as yet unmatched pairs. I will hazard this is a count of reads (either singly mapping or multimapping) whose mates (as defined by the tool) the tool cannot find (even if in the file or perhaps missing from the file). At this point 1 out of 63 of your reads is in this state.
2. Value was put into PairInfoMap more than once. 10: Without going into detail, this to me implies that the tool cannot figure out how to pair the reads given the state of the SAM flags within the BAM. For example, for a multimapping set, when it encounters a third alignment with the same queryname, it throw the exeception as it doesn't know which pair to consider the proper pair on which to decide duplicate insert status. MarkDuplicates flags duplicate inserts so needs to know which of the multiple alignments for a multimapping set to base the proper insert on. As far as deciphering the 10:, the code says to return sequenceIndex + ": " + keyAndRecord.getKey(), so 10 is the sequenceIndex`.

Can you check your multimapping read sets (sets that have secondary/supplementary alignments) to see if the proper pair flag (0x2 bit) is set for one pair of the reads and that the remaining reads have the mate unmapped flag (0x8 bit)? You can pull out sets using instructions at the end of this blog.

If these flags are not set for your multimapping sets, then a solution is to use MergeBamAlignment (link goes to step 3C). Among other things, MergeBamAlignment will add back unmapped mates that BWA drops and sets the proper pair flag for one pair in a multimapping set. It also dissociates the remaining reads in the multimapping set from the proper pair by setting their mate unmapped flag.

Finally, I'd like to point out that MarkDuplicates now accepts query-group sort order. The output of BWA-MEM is in this order so long as you did your alignments on a query-sorted unaligned BAM. Using MarkDuplicates with this type of input adds additional flagging for the supplementary/secondary and unmapped reads of set for which the proper pair is flagged duplicate. It may be that you would encounter your error faster for this type of input.