The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# [How to] Generate a BAM for variant discovery (long)

edited October 2016 in Archive

### This document is an archived rough draft of Tutorial#6483. Please use the public tutorial. If you are interested in aligning to GRCh38, then please refer to a separate tutorial, Tutorial#8017.

[work in progress--I am breaking this up into smaller chunks]

This document in part replaces the previous post (howto) Revert a BAM file to FastQ format that uses HTSlib commands. The workflow assumes familiarity with the concepts given in Collected FAQs about BAM files.

We outline steps to preprocess Illumina and similar tech DNA sequence reads for use in GATK's variant discovery workflow. This preprocessing workflow involves marking adapter sequences using MarkIlluminaAdapters so they contribute minimally to alignments, alignment using the BWA aligner's maximal exact match (MEM) algorithm, and preserving and adjusting read and read meta data using MergeBamAlignment for consistency and comparability of downstream results with analyses from the Broad Institute. With the exception of BWA, we use the most current versions of tools as of this writing. The workflow results in an aligned BAM file with appropriate meta information that is ready for processing with MarkDuplicates.

This workflow applies to three common types of sequence read files: (A) aligned BAMs that need realignment, (B) FASTQ format data and (C) raw sequencing data in BAM format. If you have raw data in BAM format (C), given appropriate read group fields, you can start with step 2. The other two formats require conversion to unmapped BAM (uBAM). We use Picard's RevertSam to convert an aligned BAM (A) or Picard's FastqToSam to convert a FASTQ (B) to the uBAM.

We address options relevant to process reads extracted from an interval as well as options to process large files, in our case a ~150G file called Solexa-272222. The tutorial uses a smaller file of reads aligning to a genomic interval, called snippet derived from Solexa-272222, for faster processing. The example commands apply to the larger file. Some comments on the workflow:

• The workflow reflects a lossless operating procedure that retains original FASTQ read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and longterm storage efficient, as one needs only store the final BAM file.
• When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
• Finally, when I call default options within a command, follow suit to ensure the same results.

#### The steps of the workflow are as follows.

1. Generate an unmapped BAM (uBAM)
(A) Convert the FASTQ to uBAM and add read group information using FastqToSam
(B1) [Optional] Extract reads in a genomic interval from aligned BAM
(B2) Convert aligned BAM to uBAM and discard problematic records using RevertSam
3. Convert uBAM to FASTQ and assign adapter bases low qualities using SamToFastq
4. Align reads and flag secondary hits using BWA MEM
5. [Optional] Pipe steps 3 & 4 and collect alignment metrics
6. [Optional] Sort, index and convert alignment to a BAM using SortSam and visualize on IGV
7. Restore altered data and apply & adjust meta information using MergeBamAlignment

### 1. Generate an unmapped BAM (uBAM)

The goal is to produce an unmapped BAM file with appropriate read group (@RG) information that differentiates not only samples, but also factors that contribute to technical artifacts. To see the read group information for a BAM file, use the following command.

samtools view -H Solexa-272222.bam | grep '@RG'


This prints the lines starting with @RG within the header. Our tutorial file's single @RG line is shown below. The file has the read group fields required by this workflow as well as extra fields for record keeping. Two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

• GATK's variant discovery workflow requires ID, SM and LB fields and recommends the PL field.
• Each @RG line has a unique ID that differentiates read groups. It is the lowest denominator that differentiates factors contributing to technical batch effects and is repeatedly indicated by the RG tag for each read record. Thus, the length of this field contributes to file size.
• SM indicates sample name and, within a collection of samples, LB indicates if the same sample was sequenced in multiple lanes. See item 8 of Collected FAQs about BAM files for more detail.
• PU is not required by any GATK tool. If present it is used by BQSR instead of ID. It is required by Picard's AddOrReplaceReadGroups but not FastqToSam.

Here we illustrate how to derive both ID and PU fields from query names. We break down the common portion of two different read query names from the tutorial file.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

#Breaking down the common portion of the query names:
H0164____________ # portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ # portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 # portion of @RG ID and PU fields indicating flow cell lane


#### (A) Convert the FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for our workflow. All other read group fields are optional.

java -Xmx8G -jar /seq/software/picard/current/bin/picard.jar FastqToSam \
FASTQ=snippet_XT_interleaved.fq \ #our single tutorial file contains both reads in a pair
OUTPUT=snippet_FastqToSam_PU.bam \
READ_GROUP_NAME=H0164.2 \ # required; changed from default of A
SAMPLE_NAME=NA12878 \ # required
LIBRARY_NAME=Solexa-272222 \ # required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ # recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400


Some details on select parameters:

• QUALITY_FORMAT is detected automatically if unspecified.
• SORT_ORDER by default is queryname.
• Specify both FASTQ and FASTQ2 for paired reads in separate files.
• PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
• RUN_DATE is in Iso8601 date format.

#### (B1) [Optional] Extract reads in a genomic interval from aligned BAM

We want to test our reversion process on a subset of the tutorial file before committing to reverting the entire BAM. This process requires the reads in the BAM to be aligned to a reference genome and produces a BAM containing reads from a genomic interval.

java -Xmx8G -jar /path/GenomeAnalysisTK.jar \
-R /path/human_g1k_v37_decoy.fasta \
-L 10:90000000-100000000 \ # this is the retained interval
-I Solexa-272222.bam -o snippet.bam # snippet.bam is newly created

• This seems a good time to bring this up. In the command, the -Xmx8G Java option sets the maximum heap size, or memory usage to eight gigabytes. We want to both cap Java's use of memory so the system doesn't slow down as well as allow enough memory for the tool to run without causing an out of memory error. The -Xmx settings we provide here is more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. I have heard up to16G specified and have also omitted this option for small files. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.
• This step is for our tutorial only. For applying interval lists, e.g. to whole exome data, see When should I use L to pass in a list of intervals.

#### (B2) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information. The resulting unmapped BAM (uBAM) has two uses in this workflow: (1) for processing through the MarkIlluminaAdapters branch of the workflow, and (2) for application of read group, read sequence and other read meta information to the aligned read file in the MergeBamAlignment branch of the workflow. The RevertSam parameters we specify remove information pertaining to previous alignments including program group records and standard alignment flags and tags that would otherwise transfer over in the MergeBamAlignment step. We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option so that it is free for use by MarkIlluminaAdapters. Our settings also reset flags to unmapped values, e.g. 77 and 141 for paired reads. Additionally, we invoke the SANITIZE option to remove reads that cause problems for MarkIlluminaAdapters. Our tutorial's snippet requires such filtering while Solexa-272222 does not.

For our particular file, we use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
I=snippet.bam \
O=snippet_revert.bam \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \ # informational; does not affect processing
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
SORT_ORDER=queryname \ #default
RESTORE_ORIGINAL_QUALITIES=true \ #default
REMOVE_DUPLICATE_INFORMATION=true \ #default
REMOVE_ALIGNMENT_INFORMATION=true #default


To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee # sets environmental variable for temporary directory


We change these settings for RevertSam:

• SANITIZE If the BAM file contains problematic reads, such as that might arise from taking a genomic interval of reads (Step 1), then RevertSam's SANTITIZE option removes them. Our workflow's downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities.
• MAX_DISCARD_FRACTION is set to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.

Exception in thread "main" picard.PicardException: Discarded 0.947% which is above MAX_DISCARD_FRACTION of 0.500%

• ATTRIBUTE_TO_CLEAR is set to clear more than the default standard tags, which are NM, UQ, PG, MD, MQ, SA, MC, and AS tags. The AS tag is removed by default for Picard releases starting 9/2015. Remove all other tags, such as the XT tag needed by MarkIlluminaAdapters, by specifying each with the ATTRIBUTE_TO_CLEAR option. To list all tags within my BAM, I used the command below to get RG, OC, XN, OP, SA, MD, NM, PG, UQ, MC, MQ, AS, XT, and OQ tags. Those removed by default and by RESTORE_ORIGINAL_QUALITIES are italicized. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.

samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[\$1]++) { print }}'


Some comments on options kept at default:

• SORT_ORDER=queryname
For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.

• RESTORE_ORIGINAL_QUALITIES=true
Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.

• REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment information. It also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags.

For snippet.bam, SANITIZE removes 25,909 out of 2,735,539 (0.947%) reads, leaving us with 2,709,630 reads. The intact BAM retains all reads. The example shows a read pair before and after RevertSam.

#original BAM
H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151
H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50

#after RevertSam (step 1.B2)
H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2


java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=snippet_revert.bam \
O=snippet_revertmark.bam \
M=snippet_revertmark.metrics.txt \ #naming required
TMP_DIR=/path/shlee # optional to process large files

• By default, the tool uses Illumina adapter sequences. This is sufficient for our tutorial data. Specify other adapter sequences as outlined in the tool documentation.
• Only reads with adapter sequence are marked with the tag in XT:i:[#] format, where # denotes the starting position of the adapter sequence.

The example shows a read pair marked with the XT tag by MarkIlluminaAdapters. This is a different pair than shown previously as H0164ALXX140820:2:1101:10003:23460 reads do not contain adapter sequence. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

#after MarkIlluminaAdapters (step 2)
H0164ALXX140820:2:1101:15118:25288  77  *   0   0   *   *   0   0
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--A-FAJA-F<J7A--AFJ7AJ7AJ-FJ7-JJJ-F-J---7J---7FF-JAJJ<A7JFAFAA7--FF----AF-7<JF<JFA-7<F-FF-J RG:Z:H0164.2    XT:i:63
H0164ALXX140820:2:1101:15118:25288  141 *   0   0   *   *   0   0
GTCATGGCTGGACGCAGTGGCTCATACCTGTAATCCCAGCACTTTTGGAGGCTGAGGCAGGTAGATCGGAAGCGCCTCGTGTAGGGAGAGAGGGTTAACAAAAATGTAGATACCGGAGGTCGCCGTAAAATAAAAAAGTAGCAAGGAGTAG
AAFFFJJJJJAJJJJJFJJJJ<JFJJJJJJJJFJJJJFJ<FJJJJAJJJJJJJJFJJJ7JJ--JJJ<J<-FJ7F--<-J7--7AJJA-J------J7F<-77--F--FFJ---J-J-J--A-7<<----J-7-J-FJ--J--FA####### RG:Z:H0164.2    XT:i:63

#after SamToFastq (step 3)
@H0164ALXX140820:2:1101:15118:25288/1
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
+
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--#########################################################################################
@H0164ALXX140820:2:1101:15118:25288/2
GTCATGGCTGGACGCAGTGGCTCATACCTGTAATCCCAGCACTTTTGGAGGCTGAGGCAGGTAGATCGGAAGCGCCTCGTGTAGGGAGAGAGGGTTAACAAAAATGTAGATACCGGAGGTCGCCGTAAAATAAAAAAGTAGCAAGGAGTAG
+
AAFFFJJJJJAJJJJJFJJJJ<JFJJJJJJJJFJJJJFJ<FJJJJAJJJJJJJJFJJJ7JJ-#########################################################################################

#after MergeBamAlignment (step 7)
H0164ALXX140820:2:1101:15118:25288  99  10  99151971    60  151M    =   99152350    440
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--A-FAJA-F<J7A--AFJ7AJ7AJ-FJ7-JJJ-F-J---7J---7FF-JAJJ<A7JFAFAA7--FF----AF-7<JF<JFA-7<F-FF-J MC:Z:90S61M MD:Z:74T10T3A37T23  PG:Z:bwamem RG:Z:H0164.2    NM:i:4  MQ:i:60 UQ:i:48 AS:i:131    XS:i:40
H0164ALXX140820:2:1101:15118:25288  147 10  99152350    60  90S61M  =   99151971    -440
CTACTCCTTGCTACTTTTTTATTTTACGGCGACCTCCGGTATCTACATTTTTGTTAACCCTCTCTCCCTACACGAGGCGCTTCCGATCTACCTGCCTCAGCCTCCAAAAGTGCTGGGATTACAGGTATGAGCCACTGCGTCCAGCCATGAC
#######AF--J--JF-J-7-J----<<7-A--J-J-J---JFF--F--77-<F7J------J-AJJA7--7J-<--F7JF-<J<JJJ--JJ7JJJFJJJJJJJJAJJJJF<JFJJJJFJJJJJJJJFJ<JJJJFJJJJJAJJJJJFFFAA MC:Z:151M   MD:Z:61 PG:Z:bwamem RG:Z:H0164.2    NM:i:0  MQ:i:60 UQ:i:0  AS:i:61 XS:i:50


Snippet_revertmark.bam marks 5,810 reads (0.21%) with XT, while Solexa-272222_revertmark.bam marks 3,236,552 reads (0.39%). We plot the metrics data using RStudio.

### 3. Convert BAM to FASTQ using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove adapter sequences previously marked with the XT tag. All extant meta data, i.e. alignment information, flags and tags, are purged in this transformation.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=snippet_revertmark.bam \
FASTQ=snippet_XT_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true \
TMP_DIR=/path/shlee # optional to process large files

• By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to read alignment and alignment scoring metrics. This reassignment is temporary as we will restore the original base quality scores after alignment in step 7.

• For our paired reads sample we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file.

BWA aligner accepts interleaved FASTQ files given the -p option. This command indicates that the i-th and the (i+1)-th reads constitute a read pair.

• We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider archaic 0x200 flag bit that denotes reads that do not pass quality controls. These reads are also known as failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.

#### [Optional] Compress the FASTQ using gzip

This step is optional. The step is irrelevant if you pipe steps 3 and 4, as we outline in step 5.

BWA handles both FASTQ and gzipped FASTQ files natively--that is, BWA works on both file types directly. Thus, this step is optional. Compress the FASTQ file using the UNIX gzip utility.

gzip snippet_XT_interleaved.fq #replaces the file with snippet_XT_interleaved.fq.gz


### 4. Align reads and flag secondary hits using BWA MEM

GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

• We use BWA v 0.7.7.r441, the same aligner used by the Broad's Genomics Platform as of this writing (9/2015).
• Alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
• Aligning our snippet reads from a genomic interval against either a portion or the whole genome is not equivalent to aligning our entire file and taking a new slice from the same genomic interval.

Index the reference genome file for BWA. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.

bwa index -a bwtsw human_g1k_v37_decoy.fasta


Align using BWA MEM. The tool automatically locates the index files within the same folder as the reference FASTA file. In the alignment command, > denotes the aligned file.

• The aligned file is in SAM format even if given a BAM extension and retains the sort order of the FASTQ file. Thus, our aligned tutorial file remains sorted by query name.
• BWA automatically creates a program group record (@PG) in the header that gives the ID, group name, group version, and command line information.

/path/bwa mem -M -t 7 -p \
/path/Homo_sapiens_assembly19.fasta \ #reference genome
Solexa-272222_interleavedXT.fq > Solexa-272222_markXT_aln.sam

We invoke three options in the command.

• -M to flag shorter split hits as secondary.
This is optional for Picard compatibility. However, if we want MergeBamAlignment to reassign proper pair alignments, we need to mark secondary alignments.

• -p to indicate the given file contains interleaved paired reads.

• -t followed by a number for the additional number of processor threads to use concurrently. Check your server or system's total number of threads with the following command.

getconf _NPROCESSORS_ONLN #thanks Kate


MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, the point of this workflow is to take advantage of the features offered by MergeBamAlignment that allow for the scalable, lossless operating procedure practiced by Broad's Genomics Platform and to produce comparable metrics.

### 5. [Optional] Pipe steps 3 & 4 and collect alignment metrics

Piping processes saves time and space. Our tutorial's resulting SAM file is small enough to easily view, manipulate and store. For larger data, however, consider using Unix pipelines. Piping allows streaming data in the processor's input-output (I/O) device directly to the next process for efficient processing and storage. We recommend piping steps 3 and 4 so as to avoid rereading and storing the large intermediate FASTQ file.

You may additionally extend piping to include step 6's SortSam. Steps 3-4-6 are piped in the example command below to generate an aligned BAM file and index. [For the larger file, I couldn't pipe Step 7's MergeBamAlignment.]

#overview of command structure
[step 3's SamToFastq] | [step 4's bwa mem] | [step 6's SortSam]

#for our file
java -Xmx8G -jar /path/picard.jar SamToFastq I=snippet_revertmark.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta \
/dev/stdin | \  #to stop piping here, add '> snippet_piped.sam'
java -Xmx8G -jar /path/picard.jar SortSam \
INPUT=/dev/stdin \
OUTPUT=snippet_piped.bam \
SORT_ORDER=coordinate CREATE_INDEX=true \
TMP_DIR=/path/shlee


Calculate alignment metrics using Picard tools. Picard offers a variety of metrics collecting tools, e.g. CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. Some tools give more detailed metrics if given the reference sequence. See Picard for metrics definitions. Metrics calculations will differ if run on the BAM directly from alignment (BWA) versus on the merged BAM (MergeBamAlignment). See [link--get from G] for guidelines on when to run tools.

java -Xmx8G -jar /path/picard.jar CollectAlignmentSummaryMetrics \
R=/path/Homo_sapiens_assembly19.fasta \
INPUT=slice.bam \
OUTPUT=slice_bam_metrics.txt \
TMP_DIR=/path/shlee # optional to process large files


For example, percent chimeras is a calculated metric. Our tutorial alignment of the whole data set gives 0.019% (BWA) or 0.0034% (MergeBamAlignment) chimeric paired reads. The genomic interval defined in step 1 reports 0.0032% chimeric paired reads. In contrast, the aligned snippet gives 0.0012% (BWA) or 0.00002% (MergeBamAlignment) chimeric paired reads. This illustrates in part the differences I alluded to at the beginning of step 4.

### 6. [Optional] Sort, index and convert alignment to a BAM using SortSam and visualize on IGV

Picard's SortSam sorts, indexes and converts between SAM and BAM formats. For file manipulations and to view aligned reads using the Integrative Genomics Viewer (IGV), the SAM or BAM file must be coordinate-sorted and indexed. Some Picard tools, such as MergeBamAlignment in step 7, by default coordinate sort and can use the standard CREATE_INDEX option. If you didn't create an index in step 7, or want to convert to BAM and index the alignment file from step 4, then use Picard's SortSam. The index file will have an sai or bai extension depending on the specified format.

java -Xmx8G -jar /path/picard.jar SortSam \
INPUT=Solexa-272222_markXT_aln.sam \
OUTPUT=Solexa-272222_markXT_aln.bam \ #extension here specifies format conversion
SORT_ORDER=coordinate \
CREATE_INDEX=true \ # a standard option for Picard commands
TMP_DIR=/path/shlee # optional to process large files


View aligned reads using the Integrative Genomics Viewer (IGV). Of the multiple IGV versions, the Java Web Start jnlp version allows the highest memory, as of this writing 10 GB for machines with 64-bit Java.

• To run the jnlp version of IGV, you may need to adjust your system's Java Control Panel settings, e.g. enable Java content in the browser. Also, when first opening the jnlp, overcome Mac OS X's gatekeeper function by right-clicking the saved jnlp and selecting Open with Java Web Start.
• Load the appropriate reference genome. For our tutorial this is Human (b37).
• Go to View>Preferences and make sure the settings under the Alignments tab allows you to view reads of interest, e.g. duplicate reads. Default settings are tuned to genomic sequence libraries. Right-click on a track to access a menu of additional viewing modes. See Viewing Alignments in IGV documentation for details.
• Go to File>Load from and either load alignments from File or URL.

Here, IGV displays our example chimeric pair, H0164ALXX140820:2:1101:10003:23460 at its alignment loci. BWA's secondary alignment designation causes the mates on chromosome 10 to display as unpaired in IGV's paired view. MergeBamAlignment corrects for this when it switches the secondary alignment designation. Mates display as paired on chromosome 10.

Visualizing alignments in such a manner makes apparent certain convergent information. For example, we see that the chimeric region on chromosome 2 is a low complexity GC-rich region, apparent by the predominantly yellow coloring (representing guanine) of the reference region. We know there are many multimapping reads because reads with MAPQ score of zero are filled in white versus gray, and the region is down-sampled, as indicated by the underscoring in the log-scaled coverage chart. We can infer reads in this chromosome 2 region are poorly mapped based on the region's low complexity, depth of reads and prevalence of low MAPQ reads.

### 7. Restore altered data and apply & adjust meta information using MergeBamAlignment

Our alignment file lacks read group information and certain tags, such as the mate CIGAR (MC) tag. It has hard-clipped sequences and altered base qualities. The alignment also has some mapping artifacts we would like to correct for accounting congruency. Finally, the alignment records require coordinate sorting and indexing.

• Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations.
• MergeBamAlignment retains secondary alignments with the INCLUDE_SECONDARY_ALIGNMENTS parameter. It may be that alignments marked as secondary are truer to biology or at least reveal useful insight.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and other flags, e.g. read mapped in proper pair and mate unmapped flags, to fix mapping artifacts. We only note some changes made by MergeBamAlignment to our tutorial data and by no means comprehensively list its features.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \
UNMAPPED_BAM=Solexa-272222_revertclean.bam \
ALIGNED_BAM=Solexa-272222_markXT_aln.sam \
O=Solexa-272222_merge_IGV_raw.bam \ #output file name in SAM or BAM format
CREATE_INDEX=true \ #standard option for any Picard command
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not overlap
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain alignment tags starting with X, Y, or Z
TMP_DIR=/path/shlee #optional to process large files


You need not invoke PROGRAM options as BWA's program group information is sufficient and transfer from the alignment during the merging. If, for whatever reason, you need to apply program group information by a different means, then use MergeBamAlignment to assign each of the following program group options. Example information is given.

    PROGRAM_RECORD_ID=bwa \
PROGRAM_GROUP_NAME=bwamem \
PROGRAM_GROUP_VERSION=0.7.7-r441 \
PROGRAM_GROUP_COMMAND_LINE='/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta Solexa-272222_interleavedXT.fq > Solexa-272222_markXT_aln.sam' \


In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters.

• The path to the reference FASTA given by R should also contain the corresponding sequence dictionary with the same base name and extension .dict. Create a sequence dictionary using Picard's CreateSequenceDictionary.
• CLIP_ADAPTERS=false leaves reads unclipped.
• MAX_INSERTIONS_OR_DELETIONS=-1 retains reads irregardless of the number of insertions and deletions. Default is 1.
• PRIMARY_ALIGNMENT_STRATEGY=MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
• ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which for BWA reports suboptimal alignment scores. The XS tag in not necessary for our workflow. We retain it to illustrate that the tool only carries over select alignment information unless specified otherwise. For our tutorial data, this is the only additional unaccounted tag from the alignment. [IDK if this tag is used downstream; need to confirm I can keep this.]
• Because we have left the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment may reassign proper pair designations made by the aligner.
• By default the merged file is coordinate sorted. We set CREATE_INDEX=true to additionally create the bai index.

Original base quality score restoration is illustrated in Step 3. The following example shows a read pair for which MergeBamAlignment adjusts multiple other information fields. The query name is listed thrice because we have paired reads where one of the reads has two alignment loci, on chromosome 2 and on chromosome 10. The mate is mapped with high MAPQ to chromosome 10. The two loci align 69 and 60 nucleotide regions, respectively, and the aligned regions coincide by 15 bases. A good portion of the chromosome 2 aligned region has low base quality scores. The NM tag indicates that the chromosome 2 alignment requires one change to match the reference, while the chromosome 10 read requires two changes and this is also reflected in the MD tags that provide the mismatching positions. When tallying alignment scores, given by the AS tag, aligners penalize mismatching positions, here apparently by five points per mismatch, e.g. 60 matches minus two mismatches multiplied by five gives an alignment score of 50. Both read records have values for the XS (suboptimal alignment score) and SA (chimeric alignment) tags that indicate a split read. Flag values, set by BWA, indicate the chromosome 2 record is primary and the chromosome 10 record is secondary.

#aligned reads from step 4
H0164ALXX140820:2:1101:10003:23460  177 2   33141435    0   37S69M45S   10  91515318    0
GGGTGGGAGGGGGGGAGAGAGGGGTGGGAGAGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAAGGAAAGGAGGGAGGGAGGGAGCAAGGAAGGAAGGAAGGAAAGA ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< NM:i:1  MD:Z:51G17  AS:i:64 XS:i:64 SA:Z:10,91515130,+,60M91S,0,2;

H0164ALXX140820:2:1101:10003:23460  417 10  91515130    0   60M91H  =   91515318    339
TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCC    <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF    NM:i:2  MD:Z:48T4T6 AS:i:50 XS:i:36 SA:Z:2,33141435,-,37S69M45S,0,1;

H0164ALXX140820:2:1101:10003:23460  113 10  91515318    60  151M    2   33141435    0
CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA <FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA NM:i:0  MD:Z:151    AS:i:151    XS:i:0

#after merging (step 7)
H0164ALXX140820:2:1101:10003:23460  409 2   33141435    0   37S69M45S   =   33141435    0
GGGTGGGAGGGGGGGAGAGAGGGGTGGGAGAGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAAGGAAAGGAGGGAGGGAGGGAGCAAGGAAGGAAGGAAGGAAAGA ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< SA:Z:10,91515130,+,60M91S,0,2;  MD:Z:51G17  PG:Z:bwamem RG:Z:H0164.2    NM:i:1  UQ:i:2  AS:i:64 XS:i:64

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339
TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< SA:Z:2,33141435,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:bwamem RG:Z:H0164.2    NM:i:2  MQ:i:60 UQ:i:4  AS:i:50 XS:i:36

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339
CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA <FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA MC:Z:60M91S MD:Z:151    PG:Z:bwamem RG:Z:H0164.2    NM:i:0  MQ:i:0  UQ:i:0  AS:i:151    XS:i:0

• For the read with two alignments, the aligner hard-clipped the alignment on chromosome 10 giving a CIGAR string of 60M91H and a truncated read sequence. MergeBamAlignment restores this chromosome 10 alignment with a full read sequence and adjusts the CIGAR string to 60M91S, which soft-clips the previously hard-clipped region without loss of alignment specificity.
• Both chromosome 2 and chromosome 10 alignments have zero mapping qualities to indicate multiple equally likely mappings. The similar alignment scores of 64 and 50, given by the AS tag, contribute in part to this ambiguity. Additionally, because we asked the aligner to flag shorter split reads as secondary, with the -M option, it assigned a 417 flag to the shorter split chromosome 10 alignment. This makes the chromosome 2 alignment for this read the primary alignment. We set our PRIMARY_ALIGNMENT_STRATEGY to MostDistant which asks the tool to consider the best pair to mark as primary from the primary and secondary records. MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment (163 flag) and the chromosome 2 mapping as secondary (409 flag).
• MergeBamAlignment updates read group RG information, program group PG information and mate CIGAR MC tags as specified by our command for reads and in the header section. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment) and MQ (mapping quality of the mate/next segment) tags if applicable. The following table summarizes changes to our tutorial data's tags during the workflow.
original RevertSam BWA MEM MergeBamAlignment
PG PG PG program group
OC original cigar
XN # of ambiguous bases in ref
OP original mapping position
SA SA SA chimeric alignment
MD MD MD string for mismatching positions
NM NM NM # of mismatches
AS AS AS alignment score
UQ UQ Phred likelihood of the segment
MC MC CIGAR string for mate
MQ MQ mapping quality of the mate
OQ original base quality
XT tool specific
XS XS BWA's secondary alignment score
• In our example, we retained the aligner generated XS tag, for secondary alignment scores, with the ATTRIBUTES_TO_RETAIN option.

After merging our whole tutorial file, our unmapped read records increases by 620, from 5,334,323 to 5,334,943 due to changes in flag designations and not because any reads failed to map. Our total read records remains the same at 828,846,200 for our 819,728,254 original reads, giving ~1.11% multi-record reads.

Post edited by shlee on

• Dev Posts: 50

• the interleave file is not something that most people expect to see, so there should be some explanation about it and option also not to use it in case people want to use some other aligner.
• the more efficient way is to stream the output of one step to the next without outputting the intermediate files, especially if it is an interleave file.
edited September 2015

Thanks @ami . I have explained interleave and its implications. How do you stream outputs?

Post edited by shlee on
• Dev Posts: 50

It has been a while since I did it, but as far as I remember you can put the different steps in one command line with | bewteen them. For example: java -Xmx8G -jar /seq/software/picard/current/bin/picard.jar SamToFastq .... | bwa-mem... | gzip
HL will be able to remember it better I think.
If you can't figure it out with him, I can probably find some old code from CMI/genome Bridge days, where we used it.

edited August 2015

@ami I see. Geraldine has a version of this I could adopt for the new workflow in an older document in step 4. I'll test this out.

@shlee can you define ID, SM, LB, and PL in the text:
GATK's variant discovery workflow requires ID, SM and LB fields and recommends the PL field.
Thanks,
-D

edited October 2015

Will do @dekling in the linked doc. I should mention the doc as it stands is at some Vanilla limit for length and I cannot add more content.

• South KoreaMember Posts: 3

As this pipeline, finally the quality scores of adapter sequences are restored, and are not clipped out. So I wonder if variant callers such as HaplotypeCaller or UnifiedGenotyper call the adapter sequences as variants incorrectly. Or the MD tag or any other tags in BAM file can prevent that false consequence.

• South KoreaMember Posts: 3
edited September 2016

Shoud I use dontUseSoftClippedBases option at HaplotypeCaller to get more accurate results?

Hi @Kyuwon,