The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# (howto) Revert a BAM file to FastQ format

edited July 2013

#### Objective

Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.

#### Prerequisites

• Installed HTSlib

#### Steps

1. Shuffle the reads in the bam file
2. Revert the BAM file to FastQ format
3. Compress the FastQ file

### 1. Shuffle the reads in the bam file

#### Action

Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:

htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam


#### Expected Result

This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.

The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.

### 2. Revert the BAM file to FastQ

#### Action

Revert the BAM file to FastQ format by running the following HTSlib command:

htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq


#### Expected Result

This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads.

Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.

### 3. Compress the FastQ file

#### Action

Compress the FastQ file to reduce its size using the gzip utility:

gzip interleaved_reads.fq


#### Expected Result

This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.

BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.

### 4. Note for advanced users

If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 10Member

Hi Geraldine,

I found the bamshuf command and had a question about this and the overall philosophy of doing so... In the past I have query-sorted bams prior to make FASTQs from bam files (illumina-produced BAMs are notorious for failing SAM spec).

The idea was query sorting should prevent the pitfalls of storing information large rearrangements. So, is htscmd bamshuf doing sort-of the same thing or is there a reason to shuffle vs query sort?

Thanks as always!

Jason

Hi Jason,

I think it's similar; the idea of using bamshuf is to randomize the order of reads to avoid biases in the mapping process that come from coordinate-sorted data.

Geraldine Van der Auwera, PhD

• Posts: 14Member

Thanks for this page Geraldine, I never heard of this bias before and do not find it mentioned elsewhere. Is there a consensus reference text about this issue?

I know that the remaining part is not a direct GATK matter but hopefully it will help others adventurous.

After some level of trial and ERRORs, I found that BWA mem can handle interleaved fastQ provided one adds -p to the bwa mem command. By contrast, I did not find a way to feed BWA aln with interleaved fastQ data (is there a way?).

I then discovered that BWA aln can take bam as input, we can therefore feed it with the shuffled bam instead of going the long way of de-interleaving the FastQ?

# first shuffle your bam file to 'shuffled_reads.bam' as above ..

# second remap using bwa aln and 8 threads
# bwa aln man extract
#        -b        the input read file is in the BAM format
#         ...
#        -1        use the 1st read in a pair (effective with -b)
#        -2        use the 2nd read in a pair (effective with -b)

bwa aln -t 8 ref.fa -b1 shuffled_reads.bam > 1.sai
bwa aln -t 8 ref.fa -b2 shuffled_reads.bam > 2.sai

# provide twice the bam file, once for each end
samtools view -Su - | \
samtools sort - aln_remapped
# rm 1.sai 2.sai



Stephane

I'm not aware of any reference text on this, it is a part of the shared knowledge in our group and other collaborators.

Thanks for your comments on the bwa options and the script, I'm sure they will be useful for others.

Geraldine Van der Auwera, PhD

• Posts: 24Member

Hello, could you please clarify if it is the same to feed BWA with the shuffled bam file or with the interleaved fast? It is not needed to revert to a fast to lose previous alignment info? Can I skip that part and still have the same result? Thanks a lot, Ester

Hi Ester,

I believe that both approaches will give the same results. But if you want to be absolutely sure, check the BWA documentation -- since it is not our software, we can't give you any guarantees.

Geraldine Van der Auwera, PhD

• BostonPosts: 4Member

I have a question about best practices when reverting BAMs to fastq. If I'm going to completely re-analyze (mapping all the way to calling variants) the reverted fastq and there are multiple read-groups in the BAM, is it best to create a fastq for each read group?

Yes, you should separate the data by read group into individual FASTQ files. As I recall, the reverting process will discard read group information, so you'll need to add it back in afterward before merging the data back together. If you don't separate them out you'll lose the ability to distinguish between read groups, which would be bad.

Geraldine Van der Auwera, PhD

• BostonPosts: 4Member

@Geraldine_VdAuwera Thanks. That is what I thought and what I was doing just wanted a sanity check.

• Posts: 24Member

Hi Geraldine, to realign a processed bam file, I used to do: 1. RevertSam, 2.MArkIlluminaAdapters, 3.SamToFastq, 4.bwa aln, 5.bwa sampe, 6.MergeBamAlignment Should I incorporate now as a first step the 'Shuffled bam files' step? Can I continue with my pipeline steps instead of 'Revert the BAM file to FastQ' described here? Thanks, Ester

Hi @ecuenca,

The bam file shuffling is meant to eliminate biases (due to the bam being sorted) that affect the mapping step, so yes, I would recommend adding that step to your pipeline.

Geraldine Van der Auwera, PhD

• Boston, MA. USAPosts: 1Member
edited September 2

Is there a parameter which could instruct the command to output two fastQ files, read1 and read2 (for the pairend BAM), instead of having them merged? Also, so that they would still be paired and ready for tophat.

I also have found on forms versions of this command (below), and wonder if they may have some advantages:

htscmd bamshuf -Ou input.bam tmp-prefix | htscmd bam2fq -s se.fq.gz - | gzip > pe.fq.gz

htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz - | bwa mem -p ref.fa -

Post edited by Trakhtenberg on

As that is not our tool, we don't provide usage recommendations beyond the specific use case shown here. For other possibilities, please see the tool's documentation or ask your question in a more general forum such as SeqAnswers.

Geraldine Van der Auwera, PhD

• Posts: 161Member ✭✭✭

If you would like an alternate way to convert bam to fastq, resort them based on query name and have a fastq file outputs for read 1 and read 2. You can check out RevertSam.jar and SamToFastq.jar from the Picard's suite of tools. Because I am typically converting bam files that have recalibrated base call Q scores, I use RevertSam first and pipe it to SamToFastq. If your bam files don't have that (i.e. if there is no OQ tag in your bam files) then you can just use SamToFastq.

http://picard.sourceforge.net/index.shtml

Below is the basic cmd line that I (almost) always use, to give you an idea; you can look at the other options to tailor them based on what you have and what you want in the end.

java -jar $PICARD_DIR/RevertSam.jar \ INPUT=$IN_BAM \
OUTPUT=/dev/stdout \
SORT_ORDER=queryname \
COMPRESSION_LEVEL=0 \
VALIDATION_STRINGENCY=SILENT | java -jar $PICARD_DIR/SamToFastq.jar \ INPUT=/dev/stdin \ OUTPUT_PER_RG=true \ OUTPUT_DIR=$CORE_PATH/FASTQ/ \
VALIDATION_STRINGENCY=SILENT
• Posts: 10Member

Try bedtools bamtofastq -- https://code.google.com/p/bedtools/

Better yet, do it all on a pipe and send to bwa mem like you posted. Great example here in the bcbio code:

https://github.com/chapmanb/bcbio-nextgen/blob/01a6d99c7a8bb7a73ee35313c8af4c6b4d8c66fe/bcbio/ngsalign/bwa.py#L41-L43