Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Collected FAQs about BAM files

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin
edited March 2013 in FAQs

1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard validation.

See the BAM specification for more information.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.

9. What's the meaning of the standard read group fields?

For technical details, see the SAM specification on the Samtools website.

Tag Importance SAM spec definition Meaning
ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model.
SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper.
PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field.
LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

We do not require value for the CN, DS, DT, PG, PI, or PU fields.

A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

9. My BAM file doesn't have read group and sample information. How do I add it?

Use Picard's AddOrReplaceReadGroups tool to add read group information.

10. How do I know if my BAM file is valid?

Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • bruce01bruce01 Posts: 8Member
    edited September 2012

    In the example of LB and PI (section 9) should the PI not be PI:200, PI:400, PI:200, PI:400 vs 200,200,400,400? LB flags the library, and you have the 2 libraries sequenced twice, once at 200bp inserts, once at 400bp. Surely this example defines it as being sequenced twice at 200 and 400bp respectively?

    Post edited by bruce01 on
  • CarneiroCarneiro Posts: 270Administrator, GSA Member admin

    We do not require value for the CN, DS, DT, PG, PI, or PU fields

  • vsvintivsvinti Posts: 44Member

    Hi there,

    Does base recalibration look at the RG ID field to determine sequencing lanes, or does it look at the fastq headers for each read? How important is the format of the ID field for GATK (as long as it's unique per bam file)? It it doesn't contain all of the fastq header information (flowcell, lane etc), does that affect the performance of the gatk algorithms?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin

    It is important to put as much information as possible in the RG field, because that is indeed what the tool uses to determine covariates for the analysis. The more information there is in the RG, the more accurate the analysis, for base recalibration at least. Other tools are not so reliant on that information, however.

    Geraldine Van der Auwera, PhD

  • trickytanktrickytank The Walter and Eliza Hall Institute of Medical Research. Melbourne, AustraliaPosts: 3Member

    In regard to the section: "What's the meaning of the standard read group fields?" We often have multiplexed samples with the same flowcell and lane number. Should different samples (with different SM values) have the same the same read group IDs (in different BAM files), or should we include the barcode in that too? We wonder if GATK quality score recalibration might somehow use the information that the samples were sequenced in the same flowcell and lane. Is there anything wrong with setting multiple multiplexed samples with the same library? Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin

    Hi @trickytank,

    I'm not sure I understand exactly what is your question, but I can tell you that Base Recalibration should be done purely per lane. The tool is read group aware so you can have information originating from several lanes for each sample in a single bam file; each lane's worth will be recalibrated separately as long as they are distinguished by different read groups.

    Does that clarify things?

    Geraldine Van der Auwera, PhD

  • mstagliamontemstagliamonte USAPosts: 12Member
    edited December 2013

    Hi, Geraldine, Is it correct to say that different samples, sequenced together on the same lane, will have exactly the same read group (RG), and be told apart from the sample name (SM)?

    ...I promised in another topic that was my last question of the year... I guess I am not good at keeping promises...

    Max

    Post edited by mstagliamonte on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin
    edited March 17

    Tssk tssk... That's okay, Max, you can ask questions until tomorrow at 4 pm EST. After that, pfft, we're gone :)

    The read group string overall contains ID, lane info, sample info etc, so strictly speaking they're not the same. But yes, it's okay for different samples to have the same read group ID. BQSR will correctly process data by ID, and the rest of the GATK tools will process data by sample name (SM).

    EDIT: this was a bad answer (sorry!) and has been corrected below.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • furgason5furgason5 Posts: 10Member

    This is a very useful thread. However, the information about how to correct the contigs could use some clarifying. I went through the process of figuring out how to fix my bam files that are in the consortium standard to get them to work with the UCSC hg19 convention on my own a year or so ago. But now I can't remember how I got it. SortSam and ReorderSam do not seem to be doing it for me.

  • pdexheimerpdexheimer Posts: 292Member ✭✭✭

    @furgason5, that's because the advice is realign them, not to correct the names. There are slight differences between the assemblies. To my knowledge, there are none in the autosomes or sex chromosomes, but I know the mito reference differs and would expect the unplaced chroms to differ as well.

    Actually, the correct answer that avoids realignment is to use liftover. I know that there's a LiftoverVariants walker that works on VCFs, I don't know of anything for bams.

    After all the warnings, if you still want to just modify the names in the bam, it's reasonably difficult. I don't think there are any tools that will do this for you, but what you would do is to write it out as a sam, filter each read to update the contig names (in both fields - one for the current alignment, one for the mate), update the @SQ dictionary in the header, and then convert back to bam. And if you care about any of the optional read tags, you'll need to update those as well (I think BWA puts alternate alignments in one of those tags occasionally).

    On balance, I usually find it easy to just realign

  • jfarrelljfarrell Posts: 28Member

    If a RG id is unique within a bam file but not unique across bam files, it sounds like the RG id will need to be updated to a unique id across all bam files to be analyzed. It that the case? We have bam files from a sequencing center that has simply sequentially assigned RG ID to 0, 0.1 or 0.2 within each bam file. What is the best approach to deal with this? Is there a way to automatically generate unique RG ids across these bam files or is an update of the RG id required in each bam file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin

    Hi @jfarrell,

    Different tools use different RG tags to aggregate data, so for those that don't use the ID (but instead use SM, for example) you may get away with non-unique RG IDs. HOWEVER this is generally a bad idea and may cause you terrible headaches down the road, so I would strongly suggest using unique IDs. This should be a data management policy decision. You can update the RGs as you get the data from the sequencing center (or better yet, yell at them and demand that they assign sensible RGs); see the recommendations for generating unique IDs given in the table above.

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 26Member

    Hi @Geraldine_VdAuwera‌ ,

    In your answers above to @trickytank‌ and @mstagliamonte‌ , you mentioned in the case of multiplexing several samples in one lane, all read groups should have the same @RG ID. I thought this wasn't allowed. Let me show an example.

    I have two samples and I'm sequencing them in the same lane(flowcell=A, lane=1). If I understood your answers, you are saying my read groups should look like:

    @RG ID:A.1 SM:SAMPLE01 LB:SAMPLE01
    @RG ID:A.1 SM:SAMPLE02 LB:SAMPLE02

    I don't think you can have multiple read groups with the same @RG ID. If I understood the SAM spec, I thought I should have something like this:

    @RG ID:SAMPLE01 SM:SAMPLE01 LB:SAMPLE01 PU:A.1
    @RG ID:SAMPLE02 SM:SAMPLE02 LB:SAMPLE02 PU:A.1

    Now I have @RG ID that are unique, but BSQR won't use the same error model for both read groups. I wonder why BSQR doesn't use @RG PU instead? Picard already requires that tag.

    Thanks, Carlos

  • ebanksebanks Posts: 671GSA Member mod

    Correct. It is NEVER okay to have identical read group IDs. That was a mistake on our end - thanks for catching it!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 110Member ✭✭✭

    My understanding/experience has been this. PU should always be the most unique identifier. Given that PU is flowcell,lane.barcode for multiplex samples then if using SamToFastq.jar in Picard and specifying OUTPUT_PER_RG, the PU tag takes precedence over the ID tag when outputting the prefix name(s) of the fastq file(s). So if you have multiplex samples in a lane, but have a bam file for each platform unit, then individually they can have the same ID tag. However, if you are to merge them all back into a single bam file then the ID tag needs to be modified to avoid collisions, I believe MergeSamFiles.jar in Picard does this for you (or maybe a walker in GATK or both, I haven't had a need to do that in years). So given that you have two separate bam files like so;

    @RG ID:flowcellA.lane1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX

    and

    @RG ID:flowcellA.lane1 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodeY

    and then merged them together then, the RG lines should be;

    @RG ID:flowcellA.lane1.1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX

    @RG ID:flowcellA.lane1.2 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodyY

Sign In or Register to comment.