We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

I got someting wrong with “ Value was put into PairInfoMap more than once ” and "READ_NAME_REGEX "

I am trying to call snps with my mutiple sample RNA-seq data,
The RNA-seq data i have come from paired-end sequencing ,just looks like that:
sampleA.r1.fastq.gz , sampleA.r2.fastq.gz
sampleB.r1.fastq.gz , sampleB.r2.fastq.gz
sampleC.r1.fastq.gz , sampleC.r2.fastq.gz

then i got a idea to call snp :

1)use the linux command "cat" to combine sampleA.r1.fastq.gz,sampleB.r1.fastq.gz , sampleC.r1.fastq.gz into one file named " R1.fastq.gz"
2)use the linux command "cat" to combine sampleA.r2.fastq.gz,sampleB.r2.fastq.gz , sampleC.r2.fastq.gz into one file named “R2.fastq.gz”
3) then i plan to use R1.fastq.gz and R2.fastq.gz to call-snp with this workflow
here is my step :
a) use Hisat2 to align R1.fastq.gz and R2.fastq.gz to reference sequence , then i got a sam file.
b)use samtools to convert sam file into bam file
c)use AddOrReplaceReadGroups to add ReadGroup information to my bam file
d)use SortSam to sort my bam file with coordinate
e)use MarkDuplicates to mark the duplicate sequence
then i got two error:

i try to find a solution in our forums , but i failed.
i think the problem came from my first step , but it is sorry that i dont know how to fix it.
i believe you guys can tell me how to fix it ,thanks a lot :)

Best Answer


  • EADGEADG KielMember ✭✭✭

    Hi @Zea1nfO ,

    you could try the following, run your workflow seperatly for every Sample (A,B,C) and then combine your samples with MarkDuplicates. You can put mutiple sam/bam files into MarkDuplicates getting out a single Bam/Sam - File. Use ASSUME_SORT_ORDER=coordinate might help too.


  • Zea1nfOZea1nfO Member
    edited November 2018

    hi @EADG
    thanks a lot , i will try it :)

  • Zea1nfOZea1nfO Member

    can somebody else answer my questions here ? or just explain the meaning of the first error as well.
    i am really confused about that.
    thanks a lot :)

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Zea1nfO

    I think the suggestion from EADG might be helpful. Have you tried it? Please send us the entire error log to figure out what the issue might be here.


  • Zea1nfOZea1nfO Member

    i have tried it .
    when i use a bam file delivered from a combined fasta file to markduplicates with "ASSUME_SORT_ORDER=coordinate" , it still gets wrong and stops processing the data.
    when i use three bam files to markduplicates just like this command:
    gatk MarkDuplicates -R reference.fa -I A.bam -I B.bam -I B.bam -O Marked.bam -M marked.metrics --ASSUME_SORT_ORDER=coordinate
    at firts , the gatk will give me a warn:
    but it can continue to process the data.
    when i markduplicates my bam file separately just like this:
    gatk MarkDuplicates -R reference.fa -I A.bam -O A.Marked.bam -M A.marked.metrics --ASSUME_SORT_ORDER=coordinate
    gatk MarkDuplicates -R reference.fa -I B.bam -O B.Marked.bam -M B.marked.metrics --ASSUME_SORT_ORDER=coordinate
    gatk MarkDuplicates -R reference.fa -I C.bam -O C.Marked.bam -M C.marked.metrics --ASSUME_SORT_ORDER=coordinate

    the gatk still gave me the warning just like above , but it can continue to process the data as well.
    so i really want to know what meaning of the warning i talk about above.
    thank a lot.
    the entire error log is too long ,so i cant take a huge screenshot, but i can draw a draft for you:

    please just ignore the date on the picture :)

  • Zea1nfOZea1nfO Member

    thank you very much.
    i will just ignore it.

Sign In or Register to comment.