Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Read pair records have different read groups ERROR
I am running GenomeSTRiP on 66 deep sequenced bam files. Of the 3147 runs during the discovery step, 11 failed with an error like this...
SVDiscovery-113.out:java.lang.IllegalArgumentException: Read pair records have different read groups: 3175315: H12TV.1,H0YEP.2 SVDiscovery-113.out:##### ERROR MESSAGE: Read pair records have different read groups: 3175315: H12TV.1,H0YEP.2
Looking for the read pairs with that qname and read groups, the following two pairs were found.
samtools view SRR958531.bam|grep -w ^3175315 3175315 163 1 16592899 37 74M = 16593403 605 ...... RG:Z:H0YEP.2 NH:i:1 NM:i:0 3175315 83 1 16593403 37 101M = 16592899 -605 ...... RG:Z:H0YEP.2 NH:i:1 NM:i:0 samtools view ../delly/SRR960789.bam|grep -w ^3175315 3175315 99 1 16592656 37 101M = 16593199 644 ....... RG:Z:H12TV.1 NH:i:1 NM:i:0 3175315 147 1 16593199 37 101M = 16592656 -644 ....... RG:Z:H12TV.1 NH:i:1 NM:i:0
So the 2 read pairs involved in the error were from two different bam files from two different individuals from two different read groups. The QNAME in both pairs is the same though which I believe resulted in the error from GenomeSTRiP. The software paired one read from one RG in one bam file with a read from another read pair in another RG in another bam file. The software appears to assume the qname is unique across all paired end reads in all bam files. At least for these set of bam files, this does not appear to be the case.
Could the software only look for matching paired ends in the same RG to avoid this issue? Or match those pairs where the PNEXT field correspond? Any other solutions or work arounds to resolve this?
In general, does the GATK framework assume the qname for a paired end is unique across all bam files and all the RGs? I thought the naming conventions for qname paired ends used by sequencing platforms look like they should result in unique names so I was surprised to see this lack of uniqueness myself. I expect it may be a approach to decrease the storage requirements for the qname field.