We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Error in BQSR: more than one read group with same ID

cdrurycdrury United StatesMember


I have merged single-sample sorted .bam files into a single .bam file per read group with the associated @RG lines for every sample using samtools merge. When I try to run this on BQSR, I get an error that there is more than one read group with the same ID. I'm confused, because this is the point, but I'm not sure exactly where the tool is pulling this information from that is causing the problem. I've found a previous thread where someone ran this with the same read group info, and didn't seem to have a problem: http://gatkforums.broadinstitute.org/gatk/discussion/5986/bqsr-readgroups#latest

I'm using the entire read group at once for BQSR so that I can use 1B bases as the input since some of my samples are small enough they wouldn't have sufficient numbers of reads alone. I've run a test on individual samples (with unique RG values appended to the real value so there are no duplicates) and it completes the task fine.

I suspect this may be an issue with the file merge, but I'm not sure. During BQSR, is the sample ID used for anything? For example, if I run all the data from a single lane without including sample information (i.e. just include the first header), could I then subsequently use that calibration report on the individual bam files? Or does PrintReads look for the individual samples as well when re-writing?

Thank you!

Best Answer


  • cdrurycdrury United StatesMember

    I am now seeing in this post: http://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups that read groups should be unique per lane per sample, indicating to me that the appended RG information in the merged file is appropriate. Please advise...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Can you please post all the read groups that you are trying to process together?

    FYI we run BQSR per-readgroup; there is no cross-read group comparison done, only within-read group modeling. This is because what happens to one read group should not be allowed to influence another. But you should be able to provide multiple read groups if they are identified appropriately.
  • cdrurycdrury United StatesMember


    Thanks for the info that is good to know. Does this mean that each sample requires a unique RG ID even if they were from the same lane? I'm sorry to keep asking but I feel I've seen it described both ways... for example I have RG headers organized both of the following ways (subset):

    samtools view -h H0FD3ADXX1.merge.bam |grep '@RG'
    @RG ID:H0FD3ADXX.1 LB:AC-A-1.s.bam.RGZ.sam.LB SM:AC-A-1 PL:illumina
    @RG ID:H0FD3ADXX.1-7CB006C7 LB:AC-B-2.s.bam.RGZ.sam.LB SM:AC-B-2 PL:illumina
    @RG ID:H0FD3ADXX.1-147945D8 LB:AC-C-3.s.bam.RGZ.sam.LB SM:AC-C-3 PL:illumina
    @RG ID:H0FD3ADXX.1-21E1B7F7 LB:AC-D-10.s.bam.RGZ.sam.LB SM:AC-D-10 PL:illumina
    @RG ID:H0FD3ADXX.1-34C09EBF LB:AC-E-4.s.bam.RGZ.sam.LB SM:AC-E-4 PL:illumina

    This file has added alphanumeric keys when I merged the per sample .bam files into one merged.bam file, so each RG in the file is unique, but they came from the same run/lane of the machine.

    If I remove the key to produce a single read group as follows:

    samtools view -h H0FD3ADXX1.merge.replace.bam |grep '@RG'
    @RG ID:H0FD3ADXX.1 LB:AC-A-1.s.bam.RGZ.sam.LB SM:AC-A-1 PL:illumina
    @RG ID:H0FD3ADXX.1 LB:AC-B-2.s.bam.RGZ.sam.LB SM:AC-B-2 PL:illumina
    @RG ID:H0FD3ADXX.1 LB:AC-C-3.s.bam.RGZ.sam.LB SM:AC-C-3 PL:illumina
    @RG ID:H0FD3ADXX.1 LB:AC-D-10.s.bam.RGZ.sam.LB SM:AC-D-10 PL:illumina
    @RG ID:H0FD3ADXX.1 LB:AC-E-4.s.bam.RGZ.sam.LB SM:AC-E-4 PL:illumina

    when I run BQSR on the bottom file, it reports:

    ERROR MESSAGE: Input file: SAMFileHeader{VN=1.4, SO=coordinate} contains more than one RG with the same id (H0FD3ADXX.1)

    So, I have a few options.

    1. I can run the first file, which treats every sample like its own RG but will definitely suffer from too few bases.
    2. I can run the second file only maintaining one RG, where all the reads will go under one artificial sample name. I don't know how this will affect PrintReads if it is looking for specific samples or just using the covariate data for every base it encounters.
    3. I've done something else wrong here that I'm not seeing and can fix it.

    Thanks for looking at this!

Sign In or Register to comment.