The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# (howto) Fix a badly formatted BAM

edited January 26

#### Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

#### Prerequisites

• Installed Picard tools

#### Steps

1. Sort the aligned reads by coordinate order
2. Mark duplicates
4. Index the BAM file

#### Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

### 1. Sort the aligned reads by coordinate order

#### Action

Run the following Picard command:

java -jar SortSam.jar \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

#### Action

Run the following Picard command:

java -jar MarkDuplicates.jar \
METRICS_FILE=metrics.txt


#### Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

#### More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

#### Action

Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \
RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1


#### Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

### 4. Index the BAM file

#### Action

Run the following Picard command:

java -jar BuildBamIndex.jar \


#### Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• ChinaPosts: 18Member

If I want to add more header information for example, HD, VN, GO,SO, SQ,SN,LN,RG,ID,LB,SM,CL how to do that

To add or modify read group information, use Picard's AddOrReplaceReadGroup. For other header information you'll have to look at other tools in Picard or samtools. GATK does not provide functions to do this.

Geraldine Van der Auwera, PhD

• Redwood CityPosts: 3Member

This was very helpful to me, but I needed to make a few changes to get the examples to work:

In 2) the required input METRICS_FILE should be added, like this:
java -jar MarkDuplicates.jar \
Note for others: if you are using a very recent version of Picard, you may need to adapt the command lines to the new Picard syntax, which uses one monolithic jar to which you pass tool names (e.g. -jar picard.jar BuildBamIndex etc.), like we have in GATK.