Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Collected FAQs about BAM files

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin
edited March 2013 in FAQs

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x):
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x):
chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:1 LN:247249719
@SQ SN:2 LN:242951149
@SQ SN:3 LN:199501827
@SQ SN:4 LN:191273063
@SQ SN:5 LN:180857866
@SQ SN:6 LN:170899992
@SQ SN:7 LN:158821424
@SQ SN:8 LN:146274826
@SQ SN:9 LN:140273252
@SQ SN:10 LN:135374737
@SQ SN:11 LN:134452384
@SQ SN:12 LN:132349534
@SQ SN:13 LN:114142980
@SQ SN:14 LN:106368585
@SQ SN:15 LN:100338915
@SQ SN:16 LN:88827254
@SQ SN:17 LN:78774742
@SQ SN:18 LN:76117153
@SQ SN:19 LN:63811651
@SQ SN:20 LN:62435964
@SQ SN:21 LN:46944323
@SQ SN:22 LN:49691432
@SQ SN:X LN:154913754
@SQ SN:Y LN:57772954
@SQ SN:MT LN:16571
@SQ SN:NT_113887 LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm
@RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm
@RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
@RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
@RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85
EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model. SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper. PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field. LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400

Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400

Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
@RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • bruce01bruce01 Posts: 16Member
    edited September 2012

    In the example of LB and PI (section 9) should the PI not be PI:200, PI:400, PI:200, PI:400 vs 200,200,400,400? LB flags the library, and you have the 2 libraries sequenced twice, once at 200bp inserts, once at 400bp. Surely this example defines it as being sequenced twice at 200 and 400bp respectively?

    Post edited by bruce01 on
  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    We do not require value for the CN, DS, DT, PG, PI, or PU fields

  • vsvintivsvinti Posts: 47Member

    Hi there,

    Does base recalibration look at the RG ID field to determine sequencing lanes, or does it look at the fastq headers for each read? How important is the format of the ID field for GATK (as long as it's unique per bam file)? It it doesn't contain all of the fastq header information (flowcell, lane etc), does that affect the performance of the gatk algorithms?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin

    It is important to put as much information as possible in the RG field, because that is indeed what the tool uses to determine covariates for the analysis. The more information there is in the RG, the more accurate the analysis, for base recalibration at least. Other tools are not so reliant on that information, however.

    Geraldine Van der Auwera, PhD

  • trickytanktrickytank The Walter and Eliza Hall Institute of Medical Research. Melbourne, AustraliaPosts: 3Member

    In regard to the section: "What's the meaning of the standard read group fields?"
    We often have multiplexed samples with the same flowcell and lane number. Should different samples (with different SM values) have the same the same read group IDs (in different BAM files), or should we include the barcode in that too? We wonder if GATK quality score recalibration might somehow use the information that the samples were sequenced in the same flowcell and lane. Is there anything wrong with setting multiple multiplexed samples with the same library?
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin

    Hi @trickytank,

    I'm not sure I understand exactly what is your question, but I can tell you that Base Recalibration should be done purely per lane. The tool is read group aware so you can have information originating from several lanes for each sample in a single bam file; each lane's worth will be recalibrated separately as long as they are distinguished by different read groups.

    Does that clarify things?

    Geraldine Van der Auwera, PhD

  • mstagliamontemstagliamonte USAPosts: 12Member
    edited December 2013

    Hi, Geraldine,
    Is it correct to say that different samples, sequenced together on the same lane, will have exactly the same read group (RG), and be told apart from the sample name (SM)?

    ...I promised in another topic that was my last question of the year... I guess I am not good at keeping promises...

    Max

    Post edited by mstagliamonte on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin
    edited March 17

    Tssk tssk... That's okay, Max, you can ask questions until tomorrow at 4 pm EST. After that, pfft, we're gone :)

    The read group string overall contains ID, lane info, sample info etc, so strictly speaking they're not the same. But yes, it's okay for different samples to have the same read group ID. BQSR will correctly process data by ID, and the rest of the GATK tools will process data by sample name (SM).

    EDIT: this was a bad answer (sorry!) and has been corrected below.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • furgason5furgason5 Posts: 10Member

    This is a very useful thread. However, the information about how to correct the contigs could use some clarifying. I went through the process of figuring out how to fix my bam files that are in the consortium standard to get them to work with the UCSC hg19 convention on my own a year or so ago. But now I can't remember how I got it. SortSam and ReorderSam do not seem to be doing it for me.

  • pdexheimerpdexheimer Posts: 323Member, GSA Collaborator ✭✭✭

    @furgason5, that's because the advice is realign them, not to correct the names. There are slight differences between the assemblies. To my knowledge, there are none in the autosomes or sex chromosomes, but I know the mito reference differs and would expect the unplaced chroms to differ as well.

    Actually, the correct answer that avoids realignment is to use liftover. I know that there's a LiftoverVariants walker that works on VCFs, I don't know of anything for bams.

    After all the warnings, if you still want to just modify the names in the bam, it's reasonably difficult. I don't think there are any tools that will do this for you, but what you would do is to write it out as a sam, filter each read to update the contig names (in both fields - one for the current alignment, one for the mate), update the @SQ dictionary in the header, and then convert back to bam. And if you care about any of the optional read tags, you'll need to update those as well (I think BWA puts alternate alignments in one of those tags occasionally).

    On balance, I usually find it easy to just realign

  • jfarrelljfarrell Posts: 33Member

    If a RG id is unique within a bam file but not unique across bam files, it sounds like the RG id will need to be updated to a unique id across all bam files to be analyzed. It that the case? We have bam files from a sequencing center that has simply sequentially assigned RG ID to 0, 0.1 or 0.2 within each bam file. What is the best approach to deal with this? Is there a way to automatically generate unique RG ids across these bam files or is an update of the RG id required in each bam file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin

    Hi @jfarrell,

    Different tools use different RG tags to aggregate data, so for those that don't use the ID (but instead use SM, for example) you may get away with non-unique RG IDs. HOWEVER this is generally a bad idea and may cause you terrible headaches down the road, so I would strongly suggest using unique IDs. This should be a data management policy decision. You can update the RGs as you get the data from the sequencing center (or better yet, yell at them and demand that they assign sensible RGs); see the recommendations for generating unique IDs given in the table above.

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @Geraldine_VdAuwera‌ ,

    In your answers above to @trickytank‌ and @mstagliamonte‌ , you mentioned in the case of multiplexing several samples in one lane, all read groups should have the same @RG ID. I thought this wasn't allowed. Let me show an example.

    I have two samples and I'm sequencing them in the same lane(flowcell=A, lane=1). If I understood your answers, you are saying my read groups should look like:

    @RG ID:A.1 SM:SAMPLE01 LB:SAMPLE01
    @RG ID:A.1 SM:SAMPLE02 LB:SAMPLE02

    I don't think you can have multiple read groups with the same @RG ID. If I understood the SAM spec, I thought I should have something like this:

    @RG ID:SAMPLE01 SM:SAMPLE01 LB:SAMPLE01 PU:A.1
    @RG ID:SAMPLE02 SM:SAMPLE02 LB:SAMPLE02 PU:A.1

    Now I have @RG ID that are unique, but BSQR won't use the same error model for both read groups. I wonder why BSQR doesn't use @RG PU instead? Picard already requires that tag.

    Thanks,
    Carlos

  • ebanksebanks Posts: 678GATK Developer mod

    Correct. It is NEVER okay to have identical read group IDs. That was a mistake on our end - thanks for catching it!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 126Member ✭✭✭

    My understanding/experience has been this. PU should always be the most unique identifier. Given that PU is flowcell,lane.barcode for multiplex samples then if using SamToFastq.jar in Picard and specifying OUTPUT_PER_RG, the PU tag takes precedence over the ID tag when outputting the prefix name(s) of the fastq file(s). So if you have multiplex samples in a lane, but have a bam file for each platform unit, then individually they can have the same ID tag. However, if you are to merge them all back into a single bam file then the ID tag needs to be modified to avoid collisions, I believe MergeSamFiles.jar in Picard does this for you (or maybe a walker in GATK or both, I haven't had a need to do that in years). So given that you have two separate bam files like so;

    @RG ID:flowcellA.lane1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX

    and

    @RG ID:flowcellA.lane1 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodeY

    and then merged them together then, the RG lines should be;

    @RG ID:flowcellA.lane1.1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX

    @RG ID:flowcellA.lane1.2 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodyY

  • mfletchermfletcher DEPosts: 23Member

    Hello,

    I have a question about read groups: I've managed to give all my data (n=24 bams) the same RGID so obviously that needs to be fixed before I can trust anything.

    However, given that I was running BQSR on each individual bam file in isolation, does the RGID matter for this step? Will re-running BQSR on each individual bam with the corrected RGID make any difference (or could I save the compute time and replace the read groups on my previously BQSR-recalibrated bams)?

    (And on further thought: Of my 24 samples, 10 were multiplexed into one lane. So I presume these should have the same RGID. But do I still run BQSR on each of these 10 multiplexed samples individually? Or on all 10 at once (is that even possible?!))

    Thanks!

  • SheilaSheila Broad InstitutePosts: 269Member, GATK Developer, Broadie, Moderator admin

    @mfletcher

    Hi,

    For the files where 1 bam file contains the data from 1 sample run on 1 lane, you can just replace the rgid and proceed with your analysis.

    For the 10 multiplexed samples, it is recommended to run BaseRecalibrator on them together. (Yes, it is possible!) You can either input a list of the bam files with each file name on a separate line and name it with a .list at the end. You can also use -I 10 times, with each file after 1 -I.

    Recalibrating all the data from one lane together is better because it empowers the modeling process, but it may not make a big difference if the data has deep coverage and good quality sequence. It is up to you to decide whether to rerun or not, depending on how confident you are in the quality of the data.

    If you need individual files for each of the 10 samples, you can use PrintReads after BQSR. Please read about Print Reads here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_readutils_PrintReads.html

    I hope this helps.

    -Sheila

  • mfletchermfletcher DEPosts: 23Member

    Dear @Sheila‌, thanks very much for that info! I'll do as you suggest (the multiplexed samples are low coverage so I will rerun the BQSR on them collectively).
    Cheers!

  • SheilaSheila Broad InstitutePosts: 269Member, GATK Developer, Broadie, Moderator admin

    @mfletcher

    Hi,

    Please also note that typical usage is to use PrintReads after BQSR in any case, whether you want one group file or you want individual files. The difference is that in the former case, you pass in all the files as input to PrintReads in one go, while if you want separate files, you run PrintReads separately on each file in turn.

    I do not think this is stated in the PrintReads document.

    -Sheila

  • mfletchermfletcher DEPosts: 23Member

    Hi @Sheila‌, yes, the PrintReads step was already in my pipeline - the document is not particularly clear about multi and single sample input vs output, but it seems to be working fine for me. Thanks for the clarification!

  • SheilaSheila Broad InstitutePosts: 269Member, GATK Developer, Broadie, Moderator admin

    @mfletcher

    Hi,

    I am happy it is working for you. Thanks for pointing out that the document is not particularly clear. We will try to clear it up soon.

    -Sheila

  • jalvesjalves Posts: 1Member

    Hi,

    This is indeed a very useful thread. However, I am still a bit confused regarding what is the best way to proceed in terms of the read group ID assignment for different samples sequenced in the same lane.

    I have 6 lanes of data. In each lane, 12 samples were multiplexed and sequenced together. I have therefore 12 different bam files per lane (72 in total).

    Initially, I was assigning the same @RG ID for the samples sequenced in the same lane, but after reading this thread I understand that the @RG ID must be unique and therefore all samples should have different @RG ID. Is this correct?

    However, if all samples have different @RG ID now, how will BSQR now which ones came from the same lane to use the same error model? (this goes on the following of @Carlos question).

    Many thanks for the help!

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @jalves‌,

    As far as I know BSQR doesn't have a way to tell which samples came from the same lane in a multiplexing run. We have commercial support for GATK from Appistry and that was their answer to this question. I also asked here, at biostars and directly to Appistry about why GATK doesn't use PU to identify samples from the same lane. No answer so far as why that isn't the case.

    --Carlos

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Quick note, looking through my communications with Appistry I did get an answer about why not use PU. At the moment this tag is not required. I still think GATK should use the information from the tag if if present.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin

    Hi Carlos et al.,

    To be honest, this is a very low priority point for us because our in-house sequencing platform has an established system for processing and managing samples identifiers that works with how BQSR is set up. So even though PU could potentially be useful/relevant (I have not thought this all the way through myself), making the necessary changes to use it if present could lead to complications that we don't have the resources to tackle at this time.

    We're also hesitant to advise people on how to proceed exactly with read group assignment because it depends so much on how the sequencing experiment is configured (are samples multiplexed on same lanes and/or spread across lanes, how do you want to split up / merge data etc, what technologies are involved). So we prefer to communicate clearly (or so we hope) on what basis BQSR treats a set of data as a unit (the RG ID), and the idea that recalibration should be done per unit of data that is subject to the same machine biases. With that, people should be able to decide how to process their data appropriately.

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @Geraldine_VdAuwera‌ ,

    That's fair. My follow up question would be if is possible for you to share how Broad manage samples identifiers in a same lane multiplexed experiment. I don't see a way to do it and at the same time make sure recalibration is done per unit of data subjected to the same machine biases.

    Thanks, Carlos.

  • drashu_11drashu_11 Posts: 11Member

    Hi

    I have 4 recalibrated bam files produced from realined bam files (Note:I have lost my realined.bam files)

    Right now I need to split those by chromosomes

    I have tried the following code

    java -Xmx4g -jar PATH/GenomeAnalysisTK.jar -T PrintReads -R PATH/reference -I rcal.bam01 -I rcal.bam02 -I rcal.bam03 -I rcal.bam04 -L chr25 -o PATH/chr25.bam -nct 8 -S LENIENT

    It seems like working for a long time but no file PATH/chr25.bam is generating in the output directory

    Any help?

    Thanks in advance

    Ashutosh

  • SheilaSheila Broad InstitutePosts: 269Member, GATK Developer, Broadie, Moderator admin

    @drashu_11

    Hi Ashutosh,

    Why are you using -S LENIENT? This could be covering up something that is wrong with your file. Try running without it, and see if that works.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,822Administrator, GATK Developer admin

    @CarlosBorroto

    Sorry for the delayed response, I've been on the road for a workshop.

    I've checked what we do internally; actually the Broad production pipeline demultiplexes samples from each lane to individual bam files, so they are recalibrated individually. So this simplifies the processing since each sample per lane gets its own read group. The downside is that you have to be sure to have enough data per sample for recalibration to work properly, which limits the number of samples you can multiplex on a single lane.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.