The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# (How to) Generate an unmapped BAM from FASTQ or aligned BAM

edited July 2016

Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

#### Prerequisites

• Installed Picard tools

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

#### Related resources

• Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.
• This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
• To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.
• In the future we will post on how to generate a uBAM from BCL data using IlluminaBasecallsToSAM.

### (A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
OUTPUT=6484_snippet_fastqtosam.bam \
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400


Some details on select parameters:

• For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
• For single ended reads, specify the input file with FASTQ.
• QUALITY_FORMAT is detected automatically if unspecified.
• SORT_ORDER by default is queryname.
• PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
• RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ


@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################


After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2


### (B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
I=6484_snippet.bam \
O=6484_snippet_revertsam.bam \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
SORT_ORDER=queryname \ #default
RESTORE_ORIGINAL_QUALITIES=true \ #default
REMOVE_DUPLICATE_INFORMATION=true \ #default
REMOVE_ALIGNMENT_INFORMATION=true #default


To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory


We invoke or change multiple RevertSam parameters to generate an unmapped BAM

• We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[\$1]++) { print }}'


For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.

• Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval.

• In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.

Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%


Some comments on options kept at default:

• SORT_ORDER=queryname
For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.

• RESTORE_ORIGINAL_QUALITIES=true
Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.

• REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false.

Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced.

For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads.

Original BAM

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50


After RevertSam

H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2

H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2


Post edited by shlee on
Tagged:

• Member Posts: 82
edited April 2016

Hi,

How to perform FastqToSam function in picardtools when the fastq files are not interleaved i.e. forward (R1) and reverse (R2) fastq files separately. Picardtools is not taking two FASTQ inputs.

Error message: "Option 'FASTQ' cannot be specified more than once."

@aneek
Hi,

If you look at the FastqToSam documentation, you will find your answer.

-Sheila

• Member Posts: 82
edited April 2016

@Sheila

Thank you very much. I have to use FASTQ2 for the 2nd read.

Another thing I would like to ask is, whether there is any need to use FastQC and Trimmomatic to correct the quality of the raw data and remove the bad reads, if I use these steps of making a clean BAM from short read sequence recommended by GATK.

Also can I use threading option (-t) in all GATK commands to increase the speed of the tasks.

Thank you.

@aneek
Hi,

No, it is best to stick to the Best Practices. There should be no need to correct for anything before doing the Best Practices pre-processing steps.

You can find which tools accept -nt in the tool documentation.

-Sheila

• Member Posts: 82

@Sheila

Hi, Thank you very much for all the informations.

• Beijing Institute of Genomics, CASMember Posts: 111
edited July 2016

@Sheila

Hi,
In china, Because we don't have complete access to google, may be due to that reason i cannot download the data from google drive.. could you please make it available somewhere else? Like as you do for "bundle" as an "example" directory..

Thanks

edited July 2016

I've uploaded the tutorial datasets to the ftp site. You can find the data for this and other tutorials at: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/. Like with the bundle, remember to keep the password field blank.

• Beijing Institute of Genomics, CASMember Posts: 111

Thanks! @shlee

• Beijing Institute of Genomics, CASMember Posts: 111

@shlee

Hi,
I am curious Why RG tag is absent for the first read of above "After RevertSam" results section?
and could you please explain what does 77 and 141 values represent in column 2?

Thanks

You can check yourself why this may be with the command:

samtools view 6484_snippet_revertsam.bam | grep 'H0164ALXX140820:2:1101:10003:23460'


Then you'll see that both reads should have an RG tag. I've updated the example--thanks for pointing this out.

For an explanation of SAM flags, I refer you to the following:

• NYMember Posts: 15
edited September 2016

Hi, I tried to convert my paired-end fastqs to SAM using FastqToSam, but got the error: "In paired mode, read name 1 (HWI-D00377:30:H8EJDADXX:1:2209:8491:93586) does not match read name 2 (HWI-D00377:30:H8EJDADXX:1:2104:20024:30303)". Is there anyway to fix it? Thanks!

#### Issue · Github September 2016 by Sheila

Issue Number
1302
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

This question has been answered here.

• SwitzerlandMember Posts: 9

Hi,
I am using the script reported in paragraph "(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam"
As suggested, I set the SANITIZE option as true (indeed, the process has problems already at the revertsam-->revertmark step when setting it as false).
However, for some older bam files, I loose as much as half of the reads. Meaning that my interleaved fastq file as well as my remapped bam is much smaller in size and script number. Is there a way to circumvent such a big loss?

Thanks!

Hi @abor,

When you say you lose up to half of the reads, do you mean alignment records or unique reads? I ask because an aligned BAM can represent the same read in multiple alignment records and the reversion to uBAM has the effect of paring these down to one unique read.

Can you also describe further the problem you are running into? It would be helpful to us if you can post the tool error message.

• SwitzerlandMember Posts: 9

Hi, thanks. the err file reports:
"INFO 2016-11-25 15:41:50 RevertSam Discarded 23263512 out of 61936806 (37.560%) reads in order to sanitize output."

Consistently, the read count in my original bam file is 61,955,464, in my interleaved fastq is 38673294 and in my final recalibrated bam (after realignment/bqsr) is 38690072

edited November 2016

I think if they are qualitatively equivalent, then only the paired reads with missing mates would be worth salvaging. If this is the case, then we'd have to think about how to retain such reads to the exclusion of the other sanitized reads. So you would first RevertSam without sanitizing. Then, I believe you can filter records with mismatches in length of bases and qualities using GATK's MalformedReadFilter's optional parameter --filter_mismatching_base_and_quals. Because your BAM at this point is unaligned, and therefore not coordinate-sorted, you have to run PrintReads in --unsafe mode and discard the bai index that the tool outputs. As for duplicated records, a UNIX sort and uniq should remove these.

For example, for your unaligned uBAM:

samtools view ubam.bam | sort | uniq > ubam_uniquify.sam


Will print the unique reads to a SAM format file. You'll have to reattach your header before converted again to BAM format. Then:

java -jar GenomeAnalysisTK.jar -T PrintReads \
-I uniquify_ubam.bam \
-o ubam_filter_mismatching_base_and_quals.bam \
-R your_reference.fasta \
--filter_mismatching_base_and_quals \
--unsafe


Also, take a look at the ValidateSamFile document. I hope this is helpful.

Post edited by shlee on
• SwitzerlandMember Posts: 9
edited January 17

Hi @shlee, sorry for the long delay.
Thanks

Post edited by abor on
• SwitzerlandMember Posts: 9
edited January 17

Plus, I wonder why this happens. When using Bam files in which the RealignerTargetCreator and BaseRecalibrator steps are performed through the whole genome, I can pass the Bam files again through the pipeline (RevertSam, MarkIlluminaAdapters, SamToFasq, BWA, MarkDuplicates and GATK RealignerTargetCreator/BaseRecalibrator) several times without loosing any reads; instead, I notice that when using the -L option on the bait regions at the RealignerTargetCreator and BaseRecalibrator steps, reads are discarded if the BAMs are subsequently re-passed through RevertSam.
Of course I don't need to re-run RevertSam several times, but I could need to do it in the future. So the question is whether the -L option at RealignerTargetCreator and BaseRecalibrator steps introduces any tag in the read so that it is then discarded as "missing mate" when passed again through RevertSam. Hope to have been clear enough.

Post edited by abor on
Removing reads that cause validation errors is what SANITIZE=TRUE is for. If you don't want to remove them you need to disable that option -- but then some other tools may refuse to run on that data. Our supported pipelines all discard unpaired reads as a quality assurance measure.

When you use -L in the preprocessing steps, do you use it also with IndelRealigner and PrintReads?

Geraldine Van der Auwera, PhD

• SwitzerlandMember Posts: 9
edited January 17

Edited post: no, -L was used in RealignerTargetCreator, BaseRecalibrator and PrintReads, but not in IndelRealigner

• SwitzerlandMember Posts: 9

The question is why reads that were previously successfully mapped (they are mapped in the BAM file I use as input for RevertSam) are detected as unpaired reads and discarded

edited January 20

Hi @abor,

Since you mention that you are using older BAM files, I assume you don't have access to the original FASTQ files and the missing mates are forever gone. Is this true?

It's my understanding that aligners such as BWA align each read independently, even ends of mated reads. That GATK tools have built in quality control features that are different from that of aligners is to a user's benefit. At the least, you become aware of the quality of the data you are working with and can come to a decision on how to proceed with the discrepant data.

Do you think there may be a solution for this step to work properly also in the presence of reads with missing pairs?

my answer is yes, there is a solution. My understanding is that you wish to (i) process data through MarkIlluminaAdapters as well as (ii) keep the mate not found reads. Roughly, I think you can separate your paired reads and mate not found reads and process them as PE and SE reads separately, respectively. For example, because processing through MarkIlluminaAdapters only makes sense for PE reads (see Tutorial#6483 for why), you would process only the PE BAM through this step and skip this step for the SE BAM. Sorry, it appears MarkIlluminaAdapters has a parameter to process SE reads (MIN_MATCH_BASES_SE).

To merge the reads together for joint processing for downstream steps, well, here's where you'd have to do some exploration of tools. I think you would use either MergeSamFiles or MergeBamAlignment. Tutorial#6483 gives details on MergeBamAlignment. It will take as input multiple ALIGNED_BAM files, so you can provide it both the aligned PE BAM and aligned SE BAM.

I hope this is helpful. Let us know what ends up working for you and if you have additional questions.

Post edited by shlee on
• Member Posts: 3

Hi all,

I current have some RNA-Seq Tophat aligned files (no original fastq), both the Aligned bam and unmapped reads files and wanted to generate a fastq to align with another aligner (STAR or Salmon).

Do you think I should try to merge the Aligned and unmapped files and then perform the conversion?

I have a feeling that might not be correct because it would insert potential reads with no mate. Otherwise I'm a little stuck as to what to do with the unmapped file.

Thanks for any help!
Darryl

#### Issue · Github January 19 by Sheila

Issue Number
1648
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee
@abor When you use -L with PrintReads, you're telling GATK to throw away any reads outside of those intervals. Then any reads inside the intervals that had mates outside become mate-not-founds.

Geraldine Van der Auwera, PhD

Hi Darryl ( @dnousome ),

Whichever file contains the complete set of read pairs, you can use this particular file. STAR takes FASTQ files and I assume Salmon also takes FASTQ files. If the file you start with is the aligned BAM, you can revert using RevertSam, convert to FASTQ and then align with your aligner of choice. If it is the uBAM, then you can proceed with converting to FASTQ and then align. If you wish to pipe conversion to FASTQ and the alignment step, which allows you to avoid storing a FASTQ file, then checkout the workflow detailed in Tutorial#6483 for BWA alignment. I believe the concepts should still apply to your case. The tutorial then explains how and why you would use MergeBamAlignment.

• Member Posts: 3

@shlee

Most definitely! Thanks for your help.