(howto) Revert a BAM file to FastQ format
NOTE: This tutorial has been replaced by a more recent and much improved version, Tutorial#6484.
Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.
- Installed HTSlib
- Shuffle the reads in the bam file
- Revert the BAM file to FastQ format
- Compress the FastQ file
- Note for advanced users
1. Shuffle the reads in the bam file
Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:
htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam
This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.
The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.
2. Revert the BAM file to FastQ
Revert the BAM file to FastQ format by running the following HTSlib command:
htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq
This creates an interleaved FastQ file called
interleaved_reads.fq containing the now-unmapped paired reads.
Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.
3. Compress the FastQ file
Compress the FastQ file to reduce its size using the gzip utility:
This creates a gzipped FastQ file called
interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.
BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.
4. Note for advanced users
If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):
htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz