The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

(howto) Revert a BAM file to FastQ format

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited June 2016 in Tutorials

NOTE: This tutorial has been replaced by a more recent and much improved version, Tutorial#6484.


Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.


  • Installed HTSlib


  1. Shuffle the reads in the bam file
  2. Revert the BAM file to FastQ format
  3. Compress the FastQ file
  4. Note for advanced users

1. Shuffle the reads in the bam file


Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:

htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam 

Expected Result

This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.

The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.

2. Revert the BAM file to FastQ


Revert the BAM file to FastQ format by running the following HTSlib command:

htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq 

Expected Result

This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads.

Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.

3. Compress the FastQ file


Compress the FastQ file to reduce its size using the gzip utility:

gzip interleaved_reads.fq

Expected Result

This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.

BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.

4. Note for advanced users

If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz 
Post edited by shlee on

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By


  • Hi Geraldine,

    I found the bamshuf command and had a question about this and the overall philosophy of doing so... In the past I have query-sorted bams prior to make FASTQs from bam files (illumina-produced BAMs are notorious for failing SAM spec).

    The idea was query sorting should prevent the pitfalls of storing information large rearrangements. So, is htscmd bamshuf doing sort-of the same thing or is there a reason to shuffle vs query sort?

    Thanks as always!


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Jason,

    I think it's similar; the idea of using bamshuf is to randomize the order of reads to avoid biases in the mapping process that come from coordinate-sorted data.

  • splaisansplaisan Leuven / Gent (Belgium)Member

    Thanks for this page Geraldine, I never heard of this bias before and do not find it mentioned elsewhere. Is there a consensus reference text about this issue?

    I know that the remaining part is not a direct GATK matter but hopefully it will help others adventurous.

    After some level of trial and ERRORs, I found that BWA mem can handle interleaved fastQ provided one adds -p to the bwa mem command. By contrast, I did not find a way to feed BWA aln with interleaved fastQ data (is there a way?).

    I then discovered that BWA aln can take bam as input, we can therefore feed it with the shuffled bam instead of going the long way of de-interleaving the FastQ?

    # first shuffle your bam file to 'shuffled_reads.bam' as above ..
    # second remap using bwa aln and 8 threads
    # bwa aln man extract
    #        -b        the input read file is in the BAM format
    #         ...
    #        -1        use the 1st read in a pair (effective with -b)
    #        -2        use the 2nd read in a pair (effective with -b)
    bwa aln -t 8 ref.fa -b1 shuffled_reads.bam > 1.sai
    bwa aln -t 8 ref.fa -b2 shuffled_reads.bam > 2.sai
    # provide twice the bam file, once for each end
    bwa sampe ref.fa 1.sai 2.sai shuffled_reads.bam shuffled_reads.bam | \
           samtools view -Su - | \
           samtools sort - aln_remapped
    # rm 1.sai 2.sai
    # for maadventurous


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I'm not aware of any reference text on this, it is a part of the shared knowledge in our group and other collaborators.

    Thanks for your comments on the bwa options and the script, I'm sure they will be useful for others.

  • Hello,
    could you please clarify if it is the same to feed BWA with the shuffled bam file or with the interleaved fast?
    It is not needed to revert to a fast to lose previous alignment info? Can I skip that part and still have the same result?
    Thanks a lot,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Ester,

    I believe that both approaches will give the same results. But if you want to be absolutely sure, check the BWA documentation -- since it is not our software, we can't give you any guarantees.

  • adaywilladaywill BostonMember

    I have a question about best practices when reverting BAMs to fastq. If I'm going to completely re-analyze (mapping all the way to calling variants) the reverted fastq and there are multiple read-groups in the BAM, is it best to create a fastq for each read group?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes, you should separate the data by read group into individual FASTQ files. As I recall, the reverting process will discard read group information, so you'll need to add it back in afterward before merging the data back together. If you don't separate them out you'll lose the ability to distinguish between read groups, which would be bad.

  • adaywilladaywill BostonMember

    @Geraldine_VdAuwera Thanks. That is what I thought and what I was doing just wanted a sanity check.

  • Hi Geraldine, to realign a processed bam file, I used to do:
    1. RevertSam, 2.MArkIlluminaAdapters, 3.SamToFastq, 4.bwa aln, 5.bwa sampe, 6.MergeBamAlignment
    Should I incorporate now as a first step the 'Shuffled bam files' step?
    Can I continue with my pipeline steps instead of 'Revert the BAM file to FastQ' described here?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @ecuenca,

    The bam file shuffling is meant to eliminate biases (due to the bam being sorted) that affect the mapping step, so yes, I would recommend adding that step to your pipeline.

  • TrakhtenbergTrakhtenberg Boston, MA. USAMember
    edited September 2014

    Is there a parameter which could instruct the command to output two fastQ files, read1 and read2 (for the pairend BAM), instead of having them merged? Also, so that they would still be paired and ready for tophat.

    I also have found on forms versions of this command (below), and wonder if they may have some advantages:

    htscmd bamshuf -Ou input.bam tmp-prefix | htscmd bam2fq -s se.fq.gz - | gzip > pe.fq.gz

    htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz - | bwa mem -p ref.fa -

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    As that is not our tool, we don't provide usage recommendations beyond the specific use case shown here. For other possibilities, please see the tool's documentation or ask your question in a more general forum such as SeqAnswers.

  • @Trakhtenberg,

    If you would like an alternate way to convert bam to fastq, resort them based on query name and have a fastq file outputs for read 1 and read 2. You can check out RevertSam.jar and SamToFastq.jar from the Picard's suite of tools. Because I am typically converting bam files that have recalibrated base call Q scores, I use RevertSam first and pipe it to SamToFastq. If your bam files don't have that (i.e. if there is no OQ tag in your bam files) then you can just use SamToFastq.

    Below is the basic cmd line that I (almost) always use, to give you an idea; you can look at the other options to tailor them based on what you have and what you want in the end.


  • Try bedtools bamtofastq --

    Better yet, do it all on a pipe and send to bwa mem like you posted. Great example here in the bcbio code:

  • vasilisvasilis Aberystwyth universityMember

    I'm trying to do the shuffle step but I cannot find the htscmd bamshuf command.
    Could you help me with that or to recommend me another software that I can use for this step?
    I know that its not in the GATK package but I will appreciate your help a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • vasilisvasilis Aberystwyth universityMember

    Thank you very much, I found it!

  • vasilisvasilis Aberystwyth universityMember

    Another thing that I noticed is that the -a option doesn't work for bam2fq tool

  • vasilisvasilis Aberystwyth universityMember

    I'm sorry for my naive question,
    but when I am using the shuffle step the shuffled bam file is quite smaller and finally my fq files are smaller, too.
    Is there any specific reason for that?
    Thank you very much in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator



    These processing steps discard information, so it is normal for the resulting files to be smaller since they contain less information.


  • Could we get a corrected example of proper usage since the htscmd command has been changed? I've been looking for documentation in both the HTSlib and SamTools, and have come up empty. Any assistance would be greatly appreciated.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @cpatterson,

    We have a new tutorial, (How to) Generate an unmapped BAM from FASTQ or aligned BAM. Is this the type of documentation you are looking for?

  • Yes. That is what I needed. Many thanks.

Sign In or Register to comment.