Picard AddOrReplaceGroups for singly biological sample

Would like to use GATK tools for comparative genomics, specifically the DepthofCoverage and HaplotypeCaller tools. I have confusion about editing ReadGroups for bam files with Picard. All forums seem to specify large population studies/mixed samples except for one statement I found: a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group.

I have 3 different yeast strains (1 parent and 2 evolved isolates) sequenced individually via pacbio (2 smrt cells/strain) as well as illumina (300bp paired-end) which resulted in 3 fastq files per strain (RSII pb data and R1+ R2 for illumina). I have two bam files per strain, one with pacbio aligned reads and one with illumina R1+R2 aligned reads. Since each strain only had one library per platform is it makes sense to use:
RGPL= illumina for illumina_bam and pacbio for pacbio_bam
.. but I do not know what to put for RGID, RGLB, or RGPU.

Best Answers


  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @othomps101

    A similar question regarding read groups may have been discussed in the following document : read-groups. Try reading through this document and reviewing LuisaB question and answer in the following thread. Let me know if this was helpful.

  • othomps101othomps101 Member
    Hi @bshifaw
    It was somewhat helpful. I am still confused as far as non-multiplexed or non-multisample format.

    Would it be fair to use, given the head and tail read names for each fastqfile:

    @m(date_time)_instrument#_SMRTcellbarcode_set#_part#_ZMW#_(subreadregion start_stop)
    ==> StrainA_RSII_filtered_subreads.fastq <==
    ==> StrainB_RSII_filtered_subreads.fastq <==
    ==> StrainC_RSII_filtered_subreads.fastq <==

    RGPL= pacbio

    @instrument:run#:flowcellID:lane:tile:xpos:ypos read1or2:filtered?:control#:sample#
    ==> StrainA_S1_L001_R1_001.fastq <==
    @M02780:209:000000000-B69JD:1:1101:10019:1252 1:N:0:NGACCA
    @M02780:209:000000000-B69JD:1:1101:10019:1252 2:N:0:NGACCA
    @M02780:209:000000000-B69JD:1:2119:15110:25287 1:N:0:TGACCA
    @M02780:209:000000000-B69JD:1:2119:15110:25287 2:N:0:TGACCA
    ==> StrainB_S2_L001_R1_001.fastq <==
    @M02780:209:000000000-B69JD:1:1101:9381:1251 1:N:0:NCCAAT
    @M02780:209:000000000-B69JD:1:1101:9381:1251 2:N:0:NCCAAT
    @M02780:209:000000000-B69JD:1:2119:11733:25286 1:N:0:GCCAAT
    @M02780:209:000000000-B69JD:1:2119:11733:25286 2:N:0:GCCAAT
    ==> StrainC_S3_L001_R1_001.fastq <==
    @M02780:209:000000000-B69JD:1:1101:8914:1253 1:N:0:NTTGTA
    @M02780:209:000000000-B69JD:1:1101:8914:1253 2:N:0:NTTGTA
    @M02780:209:000000000-B69JD:1:2119:21190:25287 1:N:0:CTTGTA
    @M02780:209:000000000-B69JD:1:2119:21190:25287 2:N:0:CTTGTA

    RGPL= illumina
    RGPU=flowcellid.lane.strain (even though flowcellid.lane is the same)
  • othomps101othomps101 Member
    Thanks for the quick response @bshifaw
    The flowcellid.lane is the same for all three strains. Wouldn't that create problems? if the RGID is the same for all samples?
  • bshifawbshifaw Member, Broadie, Moderator admin
    Accepted Answer

    Good point,
    The same question was asked here.

    In this case, the SM tag will distinguish between the different samples.

