We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

I don't understand the error message of picard tools markduplicates

nexejwrenexejwre TiHo-Hannover, GermanyMember

Hello, I started to work with Nextseq500 data for variant calling. At the moment each job will terminate with the following error message:
INFO 2016-06-24 12:18:47 MarkDuplicates Tracking 1032268 as yet unmatched pairs. 6094 records in RAM.
INFO 2016-06-24 12:18:52 MarkDuplicates Read 63,000,000 records. Elapsed time: 00:14:18s. Time for last 1,000,000: 5s. Last read position:
INFO 2016-06-24 12:18:52 MarkDuplicates Tracking 1040251 as yet unmatched pairs. 2676 records in RAM.
[Fri Jun 24 12:19:17 CEST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 14.90 minutes.
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 10:
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:442)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:193)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

When I repeated the job, Picard tools crahed already after 40,000,000 records. There is enough disk space. Jobs were running on a Linux machine with 40 cores and 512GB ram. When I looked for the error message in WWW I always found a read name after "Value was put into PairInfoMap more than once. 10:". Here a read name is missing. What does "10:" mean? I used the following softwares: bwa-0.7.13, samtools-1.3.1, Picard Tools several versions 1.139-2.4.1. Could this be a data or an envrionmental problem?
Best regards

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    This suggests that you might have the same read name repeated more than once in your file. Did you perhaps merge data from multiple runs?

    Try running ValidateSamFile on your file in summary mode.
  • nexejwrenexejwre TiHo-Hannover, GermanyMember

    Hello, sorry for my late answer. No I did not merge mutliple runs. I have this problem for all my Nextseq samples. Normally we index 3 samples per Nextseq run and after bcl2fastq I got 3 paired end fastqs, which I filter with prinseq, align with bwa mem, convert sam to bam with samtools, sort bam with samtools and want to run picard tools markduplicates. All 3 sam/bam files were checked by ValidateSamFile without errors. All markduplicate jobs fail, if I restart a job it could be that the jobs crash after different number of records. I tried several program versions of these 3 programs. But I must also say that about 100 jobs of old samples (Miseq and Hiseq runs) ran perfectly. When there is a repeated read name, why this read name is not written in the error message?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @nexejwre -- Let me add to this conversation. Two oddities are:

    1. MarkDuplicates Tracking 1032268 as yet unmatched pairs. I will hazard this is a count of reads (either singly mapping or multimapping) whose mates (as defined by the tool) the tool cannot find (even if in the file or perhaps missing from the file). At this point 1 out of 63 of your reads is in this state.
    2. Value was put into PairInfoMap more than once. 10: Without going into detail, this to me implies that the tool cannot figure out how to pair the reads given the state of the SAM flags within the BAM. For example, for a multimapping set, when it encounters a third alignment with the same queryname, it throw the exeception as it doesn't know which pair to consider the proper pair on which to decide duplicate insert status. MarkDuplicates flags duplicate inserts so needs to know which of the multiple alignments for a multimapping set to base the proper insert on. As far as deciphering the 10:, the code says to return sequenceIndex + ": " + keyAndRecord.getKey(), so 10 is the sequenceIndex.

    Can you check your multimapping read sets (sets that have secondary/supplementary alignments) to see if the proper pair flag (0x2 bit) is set for one pair of the reads and that the remaining reads have the mate unmapped flag (0x8 bit)? You can pull out sets using instructions at the end of this blog.

    If these flags are not set for your multimapping sets, then a solution is to use MergeBamAlignment (link goes to step 3C). Among other things, MergeBamAlignment will add back unmapped mates that BWA drops and sets the proper pair flag for one pair in a multimapping set. It also dissociates the remaining reads in the multimapping set from the proper pair by setting their mate unmapped flag.

    Finally, I'd like to point out that MarkDuplicates now accepts query-group sort order. The output of BWA-MEM is in this order so long as you did your alignments on a query-sorted unaligned BAM. Using MarkDuplicates with this type of input adds additional flagging for the supplementary/secondary and unmapped reads of set for which the proper pair is flagged duplicate. It may be that you would encounter your error faster for this type of input.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Thanks for letting us know, Joern.
Sign In or Register to comment.