US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

(howto) Fix a badly formatted BAM

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin
edited July 2013 in Tutorials

Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

Prerequisites

  • Installed Picard tools

Steps

  1. Sort the aligned reads by coordinate order
  2. Mark duplicates
  3. Add read group information
  4. Index the BAM file

Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.


1. Sort the aligned reads by coordinate order

Action

Run the following Picard command:

java -jar SortSam.jar \ 
    INPUT=unsorted_reads.bam \ 
    OUTPUT=sorted_reads.bam \ 
    SORT_ORDER=coordinate 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.


2. Mark duplicate reads

Action

Run the following Picard command:

java -jar MarkDuplicates.jar \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam 

Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such.

More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.


3. Add read group information

Action

Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \ 
    INPUT=dedup_reads.bam \ 
    OUTPUT=addrg_reads.bam \ 
    RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1 

Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.


4. Index the BAM file

Action

Run the following Picard command:

java -jar BuildBamIndex \ 
    INPUT=addrg_reads.bam 

Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

Sign In or Register to comment.