The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# Collected FAQs about BAM files

edited March 2013 in FAQs

### 1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

### 2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

### 3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

• It must be aligned to one of the references described here.
• It must be sorted in coordinate order (not by queryname and not "unsorted").
• It must list the read groups with sample names in the header.
• The BAM file must pass Picard validation.

### 4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x):
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x):
chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

$samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...  If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements. ### 6. My BAM file isn't sorted that way. How can I fix it? Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence. ### 7. How can I tell if my BAM file has read group and sample information? A quick Unix command using Samtools will do the trick: $ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...


The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

\$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...


membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

### 8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

### 9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400


Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

### 10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

### 11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam


and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 16Member
edited September 2012

In the example of LB and PI (section 9) should the PI not be PI:200, PI:400, PI:200, PI:400 vs 200,200,400,400? LB flags the library, and you have the 2 libraries sequenced twice, once at 200bp inserts, once at 400bp. Surely this example defines it as being sequenced twice at 200 and 400bp respectively?

Post edited by bruce01 on

We do not require value for the CN, DS, DT, PG, PI, or PU fields

• Posts: 47Member

Hi there,

Does base recalibration look at the RG ID field to determine sequencing lanes, or does it look at the fastq headers for each read? How important is the format of the ID field for GATK (as long as it's unique per bam file)? It it doesn't contain all of the fastq header information (flowcell, lane etc), does that affect the performance of the gatk algorithms?

It is important to put as much information as possible in the RG field, because that is indeed what the tool uses to determine covariates for the analysis. The more information there is in the RG, the more accurate the analysis, for base recalibration at least. Other tools are not so reliant on that information, however.

Geraldine Van der Auwera, PhD

• The Walter and Eliza Hall Institute of Medical Research. Melbourne, AustraliaPosts: 3Member

In regard to the section: "What's the meaning of the standard read group fields?"
We often have multiplexed samples with the same flowcell and lane number. Should different samples (with different SM values) have the same the same read group IDs (in different BAM files), or should we include the barcode in that too? We wonder if GATK quality score recalibration might somehow use the information that the samples were sequenced in the same flowcell and lane. Is there anything wrong with setting multiple multiplexed samples with the same library?
Thanks

Hi @trickytank,

I'm not sure I understand exactly what is your question, but I can tell you that Base Recalibration should be done purely per lane. The tool is read group aware so you can have information originating from several lanes for each sample in a single bam file; each lane's worth will be recalibrated separately as long as they are distinguished by different read groups.

Does that clarify things?

Geraldine Van der Auwera, PhD

• USAPosts: 12Member
edited December 2013

Hi, Geraldine,
Is it correct to say that different samples, sequenced together on the same lane, will have exactly the same read group (RG), and be told apart from the sample name (SM)?

...I promised in another topic that was my last question of the year... I guess I am not good at keeping promises...

Max

Post edited by mstagliamonte on
edited March 17

Tssk tssk... That's okay, Max, you can ask questions until tomorrow at 4 pm EST. After that, pfft, we're gone :)

The read group string overall contains ID, lane info, sample info etc, so strictly speaking they're not the same. But yes, it's okay for different samples to have the same read group ID. BQSR will correctly process data by ID, and the rest of the GATK tools will process data by sample name (SM).

EDIT: this was a bad answer (sorry!) and has been corrected below.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 10Member

This is a very useful thread. However, the information about how to correct the contigs could use some clarifying. I went through the process of figuring out how to fix my bam files that are in the consortium standard to get them to work with the UCSC hg19 convention on my own a year or so ago. But now I can't remember how I got it. SortSam and ReorderSam do not seem to be doing it for me.

• Posts: 323Member, GSA Collaborator ✭✭✭

@furgason5, that's because the advice is realign them, not to correct the names. There are slight differences between the assemblies. To my knowledge, there are none in the autosomes or sex chromosomes, but I know the mito reference differs and would expect the unplaced chroms to differ as well.

Actually, the correct answer that avoids realignment is to use liftover. I know that there's a LiftoverVariants walker that works on VCFs, I don't know of anything for bams.

After all the warnings, if you still want to just modify the names in the bam, it's reasonably difficult. I don't think there are any tools that will do this for you, but what you would do is to write it out as a sam, filter each read to update the contig names (in both fields - one for the current alignment, one for the mate), update the @SQ dictionary in the header, and then convert back to bam. And if you care about any of the optional read tags, you'll need to update those as well (I think BWA puts alternate alignments in one of those tags occasionally).

On balance, I usually find it easy to just realign

• Posts: 33Member

If a RG id is unique within a bam file but not unique across bam files, it sounds like the RG id will need to be updated to a unique id across all bam files to be analyzed. It that the case? We have bam files from a sequencing center that has simply sequentially assigned RG ID to 0, 0.1 or 0.2 within each bam file. What is the best approach to deal with this? Is there a way to automatically generate unique RG ids across these bam files or is an update of the RG id required in each bam file?

Hi @jfarrell,

Different tools use different RG tags to aggregate data, so for those that don't use the ID (but instead use SM, for example) you may get away with non-unique RG IDs. HOWEVER this is generally a bad idea and may cause you terrible headaches down the road, so I would strongly suggest using unique IDs. This should be a data management policy decision. You can update the RGs as you get the data from the sequencing center (or better yet, yell at them and demand that they assign sensible RGs); see the recommendations for generating unique IDs given in the table above.

Geraldine Van der Auwera, PhD

• Posts: 33Member

Hi @Geraldine_VdAuwera‌ ,

In your answers above to @trickytank‌ and @mstagliamonte‌ , you mentioned in the case of multiplexing several samples in one lane, all read groups should have the same @RG ID. I thought this wasn't allowed. Let me show an example.

I have two samples and I'm sequencing them in the same lane(flowcell=A, lane=1). If I understood your answers, you are saying my read groups should look like:

@RG ID:A.1 SM:SAMPLE01 LB:SAMPLE01 @RG ID:A.1 SM:SAMPLE02 LB:SAMPLE02

I don't think you can have multiple read groups with the same @RG ID. If I understood the SAM spec, I thought I should have something like this:

@RG ID:SAMPLE01 SM:SAMPLE01 LB:SAMPLE01 PU:A.1 @RG ID:SAMPLE02 SM:SAMPLE02 LB:SAMPLE02 PU:A.1

Now I have @RG ID that are unique, but BSQR won't use the same error model for both read groups. I wonder why BSQR doesn't use @RG PU instead? Picard already requires that tag.

Thanks,
Carlos

• Posts: 678GATK Developer mod

Correct. It is NEVER okay to have identical read group IDs. That was a mistake on our end - thanks for catching it!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 126Member ✭✭✭

My understanding/experience has been this. PU should always be the most unique identifier. Given that PU is flowcell,lane.barcode for multiplex samples then if using SamToFastq.jar in Picard and specifying OUTPUT_PER_RG, the PU tag takes precedence over the ID tag when outputting the prefix name(s) of the fastq file(s). So if you have multiplex samples in a lane, but have a bam file for each platform unit, then individually they can have the same ID tag. However, if you are to merge them all back into a single bam file then the ID tag needs to be modified to avoid collisions, I believe MergeSamFiles.jar in Picard does this for you (or maybe a walker in GATK or both, I haven't had a need to do that in years). So given that you have two separate bam files like so;

@RG ID:flowcellA.lane1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX

and

@RG ID:flowcellA.lane1 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodeY

and then merged them together then, the RG lines should be;

@RG ID:flowcellA.lane1.1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX

@RG ID:flowcellA.lane1.2 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodyY

• DEPosts: 23Member

Hello,

I have a question about read groups: I've managed to give all my data (n=24 bams) the same RGID so obviously that needs to be fixed before I can trust anything.

However, given that I was running BQSR on each individual bam file in isolation, does the RGID matter for this step? Will re-running BQSR on each individual bam with the corrected RGID make any difference (or could I save the compute time and replace the read groups on my previously BQSR-recalibrated bams)?

(And on further thought: Of my 24 samples, 10 were multiplexed into one lane. So I presume these should have the same RGID. But do I still run BQSR on each of these 10 multiplexed samples individually? Or on all 10 at once (is that even possible?!))

Thanks!

Hi,

For the files where 1 bam file contains the data from 1 sample run on 1 lane, you can just replace the rgid and proceed with your analysis.

For the 10 multiplexed samples, it is recommended to run BaseRecalibrator on them together. (Yes, it is possible!) You can either input a list of the bam files with each file name on a separate line and name it with a .list at the end. You can also use -I 10 times, with each file after 1 -I.

Recalibrating all the data from one lane together is better because it empowers the modeling process, but it may not make a big difference if the data has deep coverage and good quality sequence. It is up to you to decide whether to rerun or not, depending on how confident you are in the quality of the data.

I hope this helps.

-Sheila

• DEPosts: 23Member

Dear @Sheila‌, thanks very much for that info! I'll do as you suggest (the multiplexed samples are low coverage so I will rerun the BQSR on them collectively).
Cheers!

Hi,

Please also note that typical usage is to use PrintReads after BQSR in any case, whether you want one group file or you want individual files. The difference is that in the former case, you pass in all the files as input to PrintReads in one go, while if you want separate files, you run PrintReads separately on each file in turn.

I do not think this is stated in the PrintReads document.

-Sheila

• DEPosts: 23Member

Hi @Sheila‌, yes, the PrintReads step was already in my pipeline - the document is not particularly clear about multi and single sample input vs output, but it seems to be working fine for me. Thanks for the clarification!

Hi,

I am happy it is working for you. Thanks for pointing out that the document is not particularly clear. We will try to clear it up soon.

-Sheila

• Posts: 1Member

Hi,

This is indeed a very useful thread. However, I am still a bit confused regarding what is the best way to proceed in terms of the read group ID assignment for different samples sequenced in the same lane.

I have 6 lanes of data. In each lane, 12 samples were multiplexed and sequenced together. I have therefore 12 different bam files per lane (72 in total).

Initially, I was assigning the same @RG ID for the samples sequenced in the same lane, but after reading this thread I understand that the @RG ID must be unique and therefore all samples should have different @RG ID. Is this correct?

However, if all samples have different @RG ID now, how will BSQR now which ones came from the same lane to use the same error model? (this goes on the following of @Carlos question).

Many thanks for the help!

• Posts: 33Member

Hi @jalves‌,

As far as I know BSQR doesn't have a way to tell which samples came from the same lane in a multiplexing run. We have commercial support for GATK from Appistry and that was their answer to this question. I also asked here, at biostars and directly to Appistry about why GATK doesn't use PU to identify samples from the same lane. No answer so far as why that isn't the case.

--Carlos

• Posts: 33Member

Quick note, looking through my communications with Appistry I did get an answer about why not use PU. At the moment this tag is not required. I still think GATK should use the information from the tag if if present.

Hi Carlos et al.,

To be honest, this is a very low priority point for us because our in-house sequencing platform has an established system for processing and managing samples identifiers that works with how BQSR is set up. So even though PU could potentially be useful/relevant (I have not thought this all the way through myself), making the necessary changes to use it if present could lead to complications that we don't have the resources to tackle at this time.

We're also hesitant to advise people on how to proceed exactly with read group assignment because it depends so much on how the sequencing experiment is configured (are samples multiplexed on same lanes and/or spread across lanes, how do you want to split up / merge data etc, what technologies are involved). So we prefer to communicate clearly (or so we hope) on what basis BQSR treats a set of data as a unit (the RG ID), and the idea that recalibration should be done per unit of data that is subject to the same machine biases. With that, people should be able to decide how to process their data appropriately.

Geraldine Van der Auwera, PhD

• Posts: 33Member

Hi @Geraldine_VdAuwera‌ ,

That's fair. My follow up question would be if is possible for you to share how Broad manage samples identifiers in a same lane multiplexed experiment. I don't see a way to do it and at the same time make sure recalibration is done per unit of data subjected to the same machine biases.

Thanks, Carlos.

• Posts: 11Member

Hi

I have 4 recalibrated bam files produced from realined bam files (Note:I have lost my realined.bam files)

Right now I need to split those by chromosomes

I have tried the following code

java -Xmx4g -jar PATH/GenomeAnalysisTK.jar -T PrintReads -R PATH/reference -I rcal.bam01 -I rcal.bam02 -I rcal.bam03 -I rcal.bam04 -L chr25 -o PATH/chr25.bam -nct 8 -S LENIENT

It seems like working for a long time but no file PATH/chr25.bam is generating in the output directory

Any help?

Ashutosh

Hi Ashutosh,

Why are you using -S LENIENT? This could be covering up something that is wrong with your file. Try running without it, and see if that works.

-Sheila