Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

(howto) Revert a BAM file to FastQ format

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin
edited July 2013 in Tutorials

Objective

Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.

Prerequisites

  • Installed HTSlib

Steps

  1. Shuffle the reads in the bam file
  2. Revert the BAM file to FastQ format
  3. Compress the FastQ file
  4. Note for advanced users



1. Shuffle the reads in the bam file

Action

Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:

htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam 

Expected Result

This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.

The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.




2. Revert the BAM file to FastQ

Action

Revert the BAM file to FastQ format by running the following HTSlib command:

htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq 

Expected Result

This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads.

Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.




3. Compress the FastQ file

Action

Compress the FastQ file to reduce its size using the gzip utility:

gzip interleaved_reads.fq

Expected Result

This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.

BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.




4. Note for advanced users

If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz 
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • caddymobcaddymob Posts: 9Member

    Hi Geraldine,

    I found the bamshuf command and had a question about this and the overall philosophy of doing so... In the past I have query-sorted bams prior to make FASTQs from bam files (illumina-produced BAMs are notorious for failing SAM spec).

    The idea was query sorting should prevent the pitfalls of storing information large rearrangements. So, is htscmd bamshuf doing sort-of the same thing or is there a reason to shuffle vs query sort?

    Thanks as always!

    Jason

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Hi Jason,

    I think it's similar; the idea of using bamshuf is to randomize the order of reads to avoid biases in the mapping process that come from coordinate-sorted data.

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 14Member

    Thanks for this page Geraldine, I never heard of this bias before and do not find it mentioned elsewhere. Is there a consensus reference text about this issue?

    I know that the remaining part is not a direct GATK matter but hopefully it will help others adventurous.

    After some level of trial and ERRORs, I found that BWA mem can handle interleaved fastQ provided one adds -p to the bwa mem command. By contrast, I did not find a way to feed BWA aln with interleaved fastQ data (is there a way?).

    I then discovered that BWA aln can take bam as input, we can therefore feed it with the shuffled bam instead of going the long way of de-interleaving the FastQ?

    # first shuffle your bam file to 'shuffled_reads.bam' as above ..

    # second remap using bwa aln and 8 threads
    # bwa aln man extract
    # -b the input read file is in the BAM format
    # ...
    # -1 use the 1st read in a pair (effective with -b)
    # -2 use the 2nd read in a pair (effective with -b)

    bwa aln -t 8 ref.fa -b1 shuffled_reads.bam > 1.sai
    bwa aln -t 8 ref.fa -b2 shuffled_reads.bam > 2.sai

    # provide twice the bam file, once for each end
    bwa sampe ref.fa 1.sai 2.sai shuffled_reads.bam shuffled_reads.bam | \
    samtools view -Su - | \
    samtools sort - aln_remapped
    # rm 1.sai 2.sai

    # for maadventurous

    Stephane

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    I'm not aware of any reference text on this, it is a part of the shared knowledge in our group and other collaborators.

    Thanks for your comments on the bwa options and the script, I'm sure they will be useful for others.

    Geraldine Van der Auwera, PhD

  • ecuencaecuenca Posts: 24Member

    Hello,
    could you please clarify if it is the same to feed BWA with the shuffled bam file or with the interleaved fast?
    It is not needed to revert to a fast to lose previous alignment info? Can I skip that part and still have the same result?
    Thanks a lot,
    Ester

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Hi Ester,

    I believe that both approaches will give the same results. But if you want to be absolutely sure, check the BWA documentation -- since it is not our software, we can't give you any guarantees.

    Geraldine Van der Auwera, PhD

  • adaywilladaywill BostonPosts: 4Member

    I have a question about best practices when reverting BAMs to fastq. If I'm going to completely re-analyze (mapping all the way to calling variants) the reverted fastq and there are multiple read-groups in the BAM, is it best to create a fastq for each read group?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Yes, you should separate the data by read group into individual FASTQ files. As I recall, the reverting process will discard read group information, so you'll need to add it back in afterward before merging the data back together. If you don't separate them out you'll lose the ability to distinguish between read groups, which would be bad.

    Geraldine Van der Auwera, PhD

  • adaywilladaywill BostonPosts: 4Member

    @Geraldine_VdAuwera Thanks. That is what I thought and what I was doing just wanted a sanity check.

  • ecuencaecuenca Posts: 24Member

    Hi Geraldine, to realign a processed bam file, I used to do:
    1. RevertSam, 2.MArkIlluminaAdapters, 3.SamToFastq, 4.bwa aln, 5.bwa sampe, 6.MergeBamAlignment
    Should I incorporate now as a first step the 'Shuffled bam files' step?
    Can I continue with my pipeline steps instead of 'Revert the BAM file to FastQ' described here?
    Thanks,
    Ester

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Hi @ecuenca,

    The bam file shuffling is meant to eliminate biases (due to the bam being sorted) that affect the mapping step, so yes, I would recommend adding that step to your pipeline.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.