(howto) Revert a BAM file to FastQ format

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin
edited July 2013 in Tutorials

Objective

Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.

Prerequisites

  • Installed HTSlib

Steps

  1. Shuffle the reads in the bam file
  2. Revert the BAM file to FastQ format
  3. Compress the FastQ file
  4. Note for advanced users

1. Shuffle the reads in the bam file

Action

Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:

htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam 

Expected Result

This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.

The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.


2. Revert the BAM file to FastQ

Action

Revert the BAM file to FastQ format by running the following HTSlib command:

htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq 

Expected Result

This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads.

Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.


3. Compress the FastQ file

Action

Compress the FastQ file to reduce its size using the gzip utility:

gzip interleaved_reads.fq

Expected Result

This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.

BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.


4. Note for advanced users

If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz 
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Issue · Github
by Geraldine_VdAuwera

Issue Number
41
State
open
Last Updated

Comments

  • caddymobcaddymob Posts: 10Member

    Hi Geraldine,

    I found the bamshuf command and had a question about this and the overall philosophy of doing so... In the past I have query-sorted bams prior to make FASTQs from bam files (illumina-produced BAMs are notorious for failing SAM spec).

    The idea was query sorting should prevent the pitfalls of storing information large rearrangements. So, is htscmd bamshuf doing sort-of the same thing or is there a reason to shuffle vs query sort?

    Thanks as always!

    Jason

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi Jason,

    I think it's similar; the idea of using bamshuf is to randomize the order of reads to avoid biases in the mapping process that come from coordinate-sorted data.

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 15Member

    Thanks for this page Geraldine, I never heard of this bias before and do not find it mentioned elsewhere. Is there a consensus reference text about this issue?

    I know that the remaining part is not a direct GATK matter but hopefully it will help others adventurous.

    After some level of trial and ERRORs, I found that BWA mem can handle interleaved fastQ provided one adds -p to the bwa mem command. By contrast, I did not find a way to feed BWA aln with interleaved fastQ data (is there a way?).

    I then discovered that BWA aln can take bam as input, we can therefore feed it with the shuffled bam instead of going the long way of de-interleaving the FastQ?

    # first shuffle your bam file to 'shuffled_reads.bam' as above ..
    
    # second remap using bwa aln and 8 threads
    # bwa aln man extract
    #        -b        the input read file is in the BAM format
    #         ...
    #        -1        use the 1st read in a pair (effective with -b)
    #        -2        use the 2nd read in a pair (effective with -b)
    
    bwa aln -t 8 ref.fa -b1 shuffled_reads.bam > 1.sai
    bwa aln -t 8 ref.fa -b2 shuffled_reads.bam > 2.sai
    
    # provide twice the bam file, once for each end
    bwa sampe ref.fa 1.sai 2.sai shuffled_reads.bam shuffled_reads.bam | \
           samtools view -Su - | \
           samtools sort - aln_remapped
    # rm 1.sai 2.sai
    
    # for maadventurous
    

    Stephane

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    I'm not aware of any reference text on this, it is a part of the shared knowledge in our group and other collaborators.

    Thanks for your comments on the bwa options and the script, I'm sure they will be useful for others.

    Geraldine Van der Auwera, PhD

  • ecuencaecuenca Posts: 24Member

    Hello,
    could you please clarify if it is the same to feed BWA with the shuffled bam file or with the interleaved fast?
    It is not needed to revert to a fast to lose previous alignment info? Can I skip that part and still have the same result?
    Thanks a lot,
    Ester

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi Ester,

    I believe that both approaches will give the same results. But if you want to be absolutely sure, check the BWA documentation -- since it is not our software, we can't give you any guarantees.

    Geraldine Van der Auwera, PhD

  • adaywilladaywill BostonPosts: 7Member

    I have a question about best practices when reverting BAMs to fastq. If I'm going to completely re-analyze (mapping all the way to calling variants) the reverted fastq and there are multiple read-groups in the BAM, is it best to create a fastq for each read group?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Yes, you should separate the data by read group into individual FASTQ files. As I recall, the reverting process will discard read group information, so you'll need to add it back in afterward before merging the data back together. If you don't separate them out you'll lose the ability to distinguish between read groups, which would be bad.

    Geraldine Van der Auwera, PhD

  • adaywilladaywill BostonPosts: 7Member

    @Geraldine_VdAuwera Thanks. That is what I thought and what I was doing just wanted a sanity check.

  • ecuencaecuenca Posts: 24Member

    Hi Geraldine, to realign a processed bam file, I used to do:
    1. RevertSam, 2.MArkIlluminaAdapters, 3.SamToFastq, 4.bwa aln, 5.bwa sampe, 6.MergeBamAlignment
    Should I incorporate now as a first step the 'Shuffled bam files' step?
    Can I continue with my pipeline steps instead of 'Revert the BAM file to FastQ' described here?
    Thanks,
    Ester

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi @ecuenca,

    The bam file shuffling is meant to eliminate biases (due to the bam being sorted) that affect the mapping step, so yes, I would recommend adding that step to your pipeline.

    Geraldine Van der Auwera, PhD

  • TrakhtenbergTrakhtenberg Boston, MA. USAPosts: 1Member
    edited September 2014

    Is there a parameter which could instruct the command to output two fastQ files, read1 and read2 (for the pairend BAM), instead of having them merged? Also, so that they would still be paired and ready for tophat.

    I also have found on forms versions of this command (below), and wonder if they may have some advantages:

    htscmd bamshuf -Ou input.bam tmp-prefix | htscmd bam2fq -s se.fq.gz - | gzip > pe.fq.gz

    htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz - | bwa mem -p ref.fa -

    Post edited by Trakhtenberg on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    As that is not our tool, we don't provide usage recommendations beyond the specific use case shown here. For other possibilities, please see the tool's documentation or ask your question in a more general forum such as SeqAnswers.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 222Member ✭✭✭

    @Trakhtenberg,

    If you would like an alternate way to convert bam to fastq, resort them based on query name and have a fastq file outputs for read 1 and read 2. You can check out RevertSam.jar and SamToFastq.jar from the Picard's suite of tools. Because I am typically converting bam files that have recalibrated base call Q scores, I use RevertSam first and pipe it to SamToFastq. If your bam files don't have that (i.e. if there is no OQ tag in your bam files) then you can just use SamToFastq.

    http://picard.sourceforge.net/index.shtml

    Below is the basic cmd line that I (almost) always use, to give you an idea; you can look at the other options to tailor them based on what you have and what you want in the end.

    java -jar $PICARD_DIR/RevertSam.jar \
    INPUT=$IN_BAM \
    OUTPUT=/dev/stdout \
    SORT_ORDER=queryname \
    COMPRESSION_LEVEL=0 \
    VALIDATION_STRINGENCY=SILENT | java -jar $PICARD_DIR/SamToFastq.jar \
    INPUT=/dev/stdin \
    OUTPUT_PER_RG=true \
    OUTPUT_DIR=$CORE_PATH/FASTQ/ \
    VALIDATION_STRINGENCY=SILENT
  • caddymobcaddymob Posts: 10Member

    Try bedtools bamtofastq -- https://code.google.com/p/bedtools/

    Better yet, do it all on a pipe and send to bwa mem like you posted. Great example here in the bcbio code:

    https://github.com/chapmanb/bcbio-nextgen/blob/01a6d99c7a8bb7a73ee35313c8af4c6b4d8c66fe/bcbio/ngsalign/bwa.py#L41-L43

  • vasilisvasilis Aberystwyth universityPosts: 6Member

    Hello,
    I'm trying to do the shuffle step but I cannot find the htscmd bamshuf command.
    Could you help me with that or to recommend me another software that I can use for this step?
    I know that its not in the GATK package but I will appreciate your help a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    @vasilis, the usage / command name has changed. See https://github.com/samtools/htslib/issues/32

    Geraldine Van der Auwera, PhD

  • vasilisvasilis Aberystwyth universityPosts: 6Member

    Thank you very much, I found it!

  • vasilisvasilis Aberystwyth universityPosts: 6Member

    Another thing that I noticed is that the -a option doesn't work for bam2fq tool

  • vasilisvasilis Aberystwyth universityPosts: 6Member

    I'm sorry for my naive question,
    but when I am using the shuffle step the shuffled bam file is quite smaller and finally my fq files are smaller, too.
    Is there any specific reason for that?
    Thank you very much in advance.

  • SheilaSheila Broad InstitutePosts: 1,928Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @vasilis

    Hi,

    These processing steps discard information, so it is normal for the resulting files to be smaller since they contain less information.

    -Sheila

Sign In or Register to comment.