Read pair records have different read groups ERROR

jfarrelljfarrell Member ✭✭

I am running GenomeSTRiP on 66 deep sequenced bam files. Of the 3147 runs during the discovery step, 11 failed with an error like this...

  SVDiscovery-113.out:java.lang.IllegalArgumentException: Read pair records have different read groups: 3175315: H12TV.1,H0YEP.2
  SVDiscovery-113.out:##### ERROR MESSAGE: Read pair records have different read groups:             3175315: H12TV.1,H0YEP.2

Looking for the read pairs with that qname and read groups, the following two pairs were found.

            samtools view SRR958531.bam|grep -w  ^3175315
            3175315 163     1       16592899        37        74M     =       16593403         605     ......     RG:Z:H0YEP.2    NH:i:1  NM:i:0
            3175315 83      1        16593403        37      101M     =       16592899        -605     ......     RG:Z:H0YEP.2     NH:i:1  NM:i:0

            samtools view ../delly/SRR960789.bam|grep -w  ^3175315
            3175315 99       1       16592656        37      101M    =       16593199         644     .......     RG:Z:H12TV.1    NH:i:1  NM:i:0
            3175315 147     1       16593199        37      101M    =       16592656        -644     .......     RG:Z:H12TV.1    NH:i:1  NM:i:0

So the 2 read pairs involved in the error were from two different bam files from two different individuals from two different read groups. The QNAME in both pairs is the same though which I believe resulted in the error from GenomeSTRiP. The software paired one read from one RG in one bam file with a read from another read pair in another RG in another bam file. The software appears to assume the qname is unique across all paired end reads in all bam files. At least for these set of bam files, this does not appear to be the case.

Could the software only look for matching paired ends in the same RG to avoid this issue? Or match those pairs where the PNEXT field correspond? Any other solutions or work arounds to resolve this?

In general, does the GATK framework assume the qname for a paired end is unique across all bam files and all the RGs? I thought the naming conventions for qname paired ends used by sequencing platforms look like they should result in unique names so I was surprised to see this lack of uniqueness myself. I expect it may be a approach to decrease the storage requirements for the qname field.


Best Answer


  • jfarrelljfarrell Member ✭✭

    Thanks Bob for the quick response! The new interim release is working well. Using the new parameter option, the discovery run completed with no errors.

  • jfarrelljfarrell Member ✭✭

    Will this issue affect the preprocessing step in GenomeSTRiP? Originally that step did not generate any errors but I was wondering if it would be important to rerun that step with this new parameter option.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    No, I don't think it will affect preprocessing.

    Preprocessing is run mostly one bam at a time so as long as the read names are unique within each bam file (per the spec) then everything should be OK.

