The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# (How to) Map and clean up short read sequence data efficiently

edited January 19

If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAMXT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

#### Prerequisites

• Installed Picard tools
• Installed GATK tools
• Installed BWA
• Reference genome
• Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

• To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
• I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

#### Other notes

• When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
• For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
TMP_DIR=/path/shlee


In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.

java -jar /path/picard.jar MarkIlluminaAdapters


To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.

• When I call default options within a command, follow suit to ensure the same results.

## 1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

• To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
• To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAMXT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
TMP_DIR=/path/shlee #optional to process large files


This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

• By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
• Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)

After SamToFastq (step 3)

After MergeBamAlignment (step 3)

## 3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

### 3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files


This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

• For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
• We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
• Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

### 3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

• Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
• For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
• As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
• BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.

bwa index -a bwtsw human_g1k_v37_decoy.fasta


The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam


This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

• -M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.

• -p to indicate the given file contains interleaved paired reads.

• -t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.

getconf _NPROCESSORS_ONLN


In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After BWA-MEM (step 3)

After MergeBamAlignment (step 3)

### 3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \
UNMAPPED_BAM=6383_snippet_revertsam.bam \
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z
TMP_DIR=/path/shlee #optional to process large files


This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

• Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
• It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
• Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
• Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
• ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
• Setting CLIP_ADAPTERS to false leaves reads unclipped.
• By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
• We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
• As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

• MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.
• Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.
• Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

• MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
• The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
• MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
• In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
• MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment

After MergeBamAlignment

### 3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail


Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]


Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \
OUTPUT=6483_snippet_piped.bam \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee


The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.


### Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAMXT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...


I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

Post edited by shlee on
Tagged:

• GreensboroMember Posts: 74

Quite a good and comprehensive workflow ! Just out of curiosity - if we follow this Broad production pipeline should we still cleanSam and fixMateinformation after aligment (BWA) or at any other stages.

You shouldn't need to as MergeBamAlignments will do all that cleanup for you.

Geraldine Van der Auwera, PhD

• GreensboroMember Posts: 74
• Member Posts: 1

Hi, do you have an idea when tutorial 6570 will be online? Thanks

@jemorg
Hi,

Can you tell us the exact name of the tutorial? Perhaps that is an outdated/archived tutorial that has been replaced.

Thanks,
Sheila

Hi @jemorg,

For now, please refer to Picard's documentation of IlluminaBasecallsToSam. I can't say when Tutorial#6570 will be written.

• KielMember Posts: 75

@shlee Ty this is quite a nice tutorial!

At the moment I playing around with paired-end amplicon data. I want to clean up my aligned bam-file with MergeBamAligment. I see in the manual broadinstitute.github.io/picard/command-line-overview.html#MergeBamAlignment that MergeBamAligment has an option to enter the first and the second read of pair as aligned bamfiles. But no option to to deliver the two bam-files with the unmapped reads.

Now my Question
Should I merge the two unmapped bam-file into one ? And use the merged bamfile as unmapped bam file ?
Or should i run MergeBamAligment twice, with the unmapped.bam for first and the second reads of pair ?

Ty!

edited March 2016

Merge the two unmapped BAMs (uBAMs) into one before running MergeBamAlignment. MergeBamAlignment looks for read pairs in the uBAM and will error if it cannot find the second read from a pair. To merge the two uBAMs, see MergeSamFiles. Be sure that MergeSamFiles does not modify your read group IDs and is set to queryname sort.

Let me know if you encounter problems.

Soo Hee

• KielMember Posts: 75

@shlee Ty for your help! I now use the FastqToSam command to create the uBAM-File, with the F1 and F2 option. It seems to running fine...just waiting for my pipeline to finish the job.

• rutgersMember Posts: 5

Hi @shlee
I'm following this tutorial but I'm stuck on the MergeBamAlignment. As you've said before, it looks for read pairs in the uBAM and will error if it doesn't find the second read from a pair. When I try to use the MergeBamAlignment command, it doesn't output any file. I've looked at different potential problem and more particularly at the previous comment.. but no luck. However when I tried FastqToSam to combine my reads I got an error message telling me that my paired reads do not have the same names. The paired reads are named "read1_1" and "read1_2" and do not have the /1 and /2 that I suppose should be present. Do you think that could be my problem? If so is there any command to change the way my reads are named?
thanks a lot!
Guillaume

#### Issue · Github April 2016 by Sheila

Issue Number
797
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Hi @daverdin,

MergeBamAlignment identifies read pair mates by their identical read names. So I think you are correct to try to resolve what I assume are your different pair read names using FastqToSam. Within the FastqToSam code we find the passage:

Paired reads must either have the exact same read names or they must contain at least one "/" and the First pair read name must end with "/1" and second pair read name ends with "/2". The baseName (read name part before the /) must be the same for both read names.

So I believe your read names must use /1 and /2 instead of your _1 and _2 for FastqToSam to recognize them. Also, be sure your read names do not contain blanks. Finally, the FastqToSam documentation shows that you need to set the STRIP_UNPAIRED_MATE_NUMBER argument to true to make sure the /1 and /2 do not carry over to the output file read names.

In terms of an easy way to change your reads, I found this forum post from 2012 that gives a variety of potential solutions. I hope you find a solution that works.

For what it's worth I think Picard should accept either convention. @daverdin, I would encourage you to put in a feature request in the Picard repository on github to make this a configurable option.

Geraldine Van der Auwera, PhD

• rutgersMember Posts: 5

Hello @shlee, @Geraldine_VdAuwera
Thanks for your responses, I've layed with my data and found that the _1 and _2 were indeed the problem in my files..
I replaced the end of each read names (sed 's/_2\$/\/2/' Sample.2.fq > Sample_test2.fq) and was able to obtain the desired output. I just need to verify that my command didn't remove anything and I should be good to continue.. I will also write a request on github.
thanks a lot for your help..

• PolandMember Posts: 15
edited April 2016

Hi, I was following the GATK Best Practice (howto) Map and mark duplicates But I did not catch this MergeBamAlignment part so here is what I did;

Begin from fastq file "was processed using Prinseq to filter and trim sequence" --> then was mapped to the reference using best practice work follow as suggested BWA mem --> then sorted , and used picard to Mark Duplicates.

But now I found that I did not do MergeBamAlignment as suggested by this tutorial, shall I rerun every thing or there is better option?!

Hi @medhat,

In the introduction to this tutorial we mention that there are various different ways of correctly pre-processing HTS data for variant discovery, of which this tutorial is only one. So depending on your study design and aims, and any alternative pre-processing that you've already done for your samples, e.g. adding in read groups and MC tags and converting hardclips to softclips, you may find this particular pre-processing workflow redundant. We provide this tutorial workflow as a guide to represent what the Broad Genomics Platform does in their pre-processing and to highlight particular considerations that were previously undocumented, i.e. MergeBamAlignment's features.

If you can tell us more about your "filter and trim" step, which I do not believe is part of our best practice workflows, and your study design, then perhaps we can better answer your question. For example, GATK workflows take base quality into account so filtering reads and trimming ends isn't part of our workflow as explained in this discussion. If you're new to our workflows, I would suggest sticking with what is outlined in our Best Practice Guide.

Soo Hee

• DohaMember Posts: 3

Hi,

I was wondering if I could get some clarification. I have been following this tutorial and using the data snippets provided. I converted the .fq to uBAM using tutorial#6484 however I get an error when I MergeBamAlignment which having done some reading I think points to the bam not being sorted correctly?

I think my script is good as if I changed the unmapped bam I generate to the reverted BAM provided in the bundle it works.

Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (HWI-ST970:586:C2PY6ACXX:1:1115:10000:27368) is behind t

I just wondered if this was known and should I be sorting the output from BWA before merging the BAM files? I was using the piped command.

Many thanks

• PolandMember Posts: 15

Thanks A lot this was very helpful

Hi @silks225. I'm looking into this and will get back to you hopefully within the week.

• DohaMember Posts: 3

edited May 2016

Hi @silks225,

First, I've updated Option A of Tutorial#6484. You'll notice now the command takes two FASTQ files (FASTQ and FASTQ2) and the tutorial_6484_FastqToSam.tar.gz download contains two FASTQ files that match the command. Please use these files and the updated tutorial instructions. It turns out FastqToSam is unable to process an interleaved paired FASTQ file. I apologize for this and thank you for bringing this to my attention.

Second, be sure to use the commands as shown as they preserve the default queryname sorting of the output files for Tutorial#6484 and Tutorial#6483. All of the commands take queryname sorted records. Only the MergeBamAlignment step produces coordinate sorted records.

Let me know if you encounter additional errors.

• DohaMember Posts: 3

Great, thanks @shlee for the clarification that makes a lot of sense I will give it ago.

Hi I was wondering why the final "clean bam" produced with MergeBamAlignment takes as unmapped bam the initial bam file and not the markilluminaadapters bam. It could be useful to preserve through any following steps the information about reads containing adapters.
Thanks,
By the way excellent "how to", very useful, thanks!!!

#### Issue · Github May 2016 by Sheila

Issue Number
949
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

@fabiodp --Yes, certainly if you will use the XT tag downstream, by all means use the output of MarkIlluminaAdapters as the uBAM in the MergeBamAlignment step. Thank you for the complement.

• AustraliaMember Posts: 17

Hi,

Thanks for the great tutorial. I have a question: I'm trying to reproduce your tutorial results with the provided data. I ran into the problem that my resulting merged bam includes reads that are marked with the "XT" tag. However, a diff with your files show that your marked_adapters.bam does include marked reads but your merged final result file "6483_snippet_piped.bam" does not. But shouldn't there still be marked reads in the result file after merging with the bwa things? I thought the "CLIP_ADAPTERS=false" stops picard from hard clipping reads and not soft clipping?

Cheers,
Tim

edited July 2016

Hi Tim (@timk),

Yes the workflow in the tutorial uses the BAM from the RevertSam step as the unmapped BAM and so we do not expect to carry-over the XT tag. However, as I replied to @fabiodp, you can also use the BAM from the MarkIlluminaAdapters step as the unmapped BAM so as to carry-over the XT tag. You'll have to add ATTRIBUTES_TO_RETAIN=XT to the MergeBamAlignment command. I agree this tag will be useful, e.g. in interpreting deviant reads arising from 3' adapter sequence in a pileup.

You bring up a good point with CLIP_ADAPTERS=false option. This MergeBamAlignment option, if set to true, only soft-clips the adapter sequence defined by the XT tag. There is no option to hard-clip adapter sequence with the Picard tool. Because downstream steps in our pre-processing workflows, e.g. MarkDuplicates or HaplotypeCaller, still use soft-clipped bases, whether or not adapter sequences are soft-clipped, results are the same in that any adapter sequence is included in consideration. Remember the MergeBamAlignment step changes BWA hard-clips to soft-clips in the merging so all types of softclips are present in the data. Presumably, reads with additional adapter sequences will be minor and these reads will mostly be discounted from contributing to variant calls in that HaplotypeCaller requires support for paths in its haplotypes graph from at least two supporting reads.

As far as I know, if you want to hard-clip these adapter sequences, you will have to do this with an outside tool or with your own script and use this hard-clipped file as the unmapped BAM. The MarkIlluminaAdapter step allows you to detect the extent of adapter sequences so you can either (i) continue with the workflow knowing the extend of 3' adapters is negligible for your dataset, (ii) go and clip moderate 3' adapters to start the workflow anew with a trimmed dataset or (iii) decide to remake your library or modify your library prep protocol because the sequences derived from the extensive small insert sizes make for a less informative dataset.

Post edited by shlee on
• AustraliaMember Posts: 17

Hi Shlee,

Sorry for reposting and thanks for the extended answer.

Cheers,
Tim

• Beijing Institute of Genomics, CASMember Posts: 111

@shlee
Hi,
i tried to reproduce results with same command line, starting from same input tutorial file "6483_snippet.bam" and reference sequence. After RevertSam -> MarkilluminaAdapters, piping and/or by following the steps (FASTqtoSam->BWA-MEM->MergeBamAlignments) separately, for both instances the number of alignment records and the number of unmapped reads are same.

INFO    2016-07-13 17:51:49 AbstractAlignmentMerger Wrote 275774 alignment records and 1198 unmapped reads.


However, these counts are different than above mentioned one. i am using bwa-0.7.12-r1039 and picards-2.2.4.

I guess this might be due to change in bwa software version, What you say?
could you please explain how you count the number of reads for RevertSam/MarkilluminaAdapters or in intermediate sam/bam(s) ? I wonder, would it be a software version difference or somewhere else..

Thanks!

#### Issue · Github July 2016 by Sheila

Issue Number
1080
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee
edited July 2016

Yes, the tutorial uses BWA v 0.7.7.r441, which the Broad Genomics Platform has used in their production workflows. This is a different version of BWA than yours. It's possible this difference is the source of the different number of alignment records, but I cannot say for sure.

It's unclear to me what you mean by "for both instances the number of alignment records and the number of unmapped reads are same" given RevertSam/MarkIlluminaAdapters doesn't have aligned records.

You can count number of records with samtools view -c. I hope you satisfy your curiosity and are able to answer your own questions.

• Beijing Institute of Genomics, CASMember Posts: 111
edited July 2016

@shlee
Hi,

i mean for "both instances" are the results that came out with "PIPING" and running the scripts altogether (instance 1) and result that came out by running the scripts (SamtoFastq, Bwa-mem, MergeBamAlignments) separately (instance 2).

For these both instances, the number of mapped and unmapped reads were matched (but differs with your results) , i.e.
INFO 2016-07-13 17:51:49 AbstractAlignmentMerger Wrote 275774 alignment records and 1198 unmapped reads

On the other hand, i checked the read counts for my own generated RevertSam and MarkIlluminaAdapters files, they were the same as mentioned above (275546 all records including XT, and 448 records that only marked with XT). bwa-mem read-mapping is the only difference that leads to different number of alignment records. If this is the case then i think using recent bwa version is more beneficial.

• Beijing Institute of Genomics, CASMember Posts: 111

@shlee
Hi,
Additionally, i am also curious about one coordinate-sorted file provided in tutorial's bundle ("/tutorial_6483_intermediate_files/6483_snippet_bwa_mem_coordinate.bam"):

As we have both "6483_snippet_revertsam.bam" and "6483_snippet_bwa_mem.sam" files are sorted by 'queryname' that would be sufficient for running MergeBamAlignments, So i wonder at which step and with what purpose this file ("/tutorial_6483_intermediate_files/6483_snippet_bwa_mem_coordinate.bam") was generated ?

Thanks!

@MUHAMMADSOHAILRAZA That's great you found the reason for the difference is the BWA version. And yes, if you have a choice per experiment set, it is always good to use the latest tool version that offers the features you need.

I included the coordinate sorted alignment file for those interested in dissecting differences before and after MergeBamAlignment, e.g. using IGV to recapitulate the view shown in the IGV screenshot I include in the tutorial. I do not document how to generate this coordinate-sorted file, but rather I just included the file. If you need to know, you can generate the same file with a Picard SortSam conversion.

• Beijing Institute of Genomics, CASMember Posts: 111
edited July 2016

@shlee

Hi again,

I downloaded aligned BAM files per sample for already published genomes of human populations. In Header section i saw RG record something like this:

              @RG     ID:LP6005441-DNA_A09    SM:LP6005441-DNA_A09


having information for only RGID and RGSM. But to reproduce the results with GATK best practices, I think i have to replace the read group (RG) information.

1. At what point in your opinion it's best to replace RG information, before starting the whole process cleanup "unmapping the aligned BAMs" or "after following cleaning BAMs after MergeBamAlignments" ?

Secondly, by checking the reads information in a BAM file, it appears that the data sequenced on multiple lanes, i.e.:

                HS2000-630_102:4:2115:1889:70619
HS2000-630_102:3:2311:13151:38215
HS2000-630_102:2:2315:18670:41735

1. Is it okay to assign the same RGIDs (As i mentioned earlier that authors did!) for the reads run on multiple lanes? could you please suggest any way to clean up this type of BAMs to deal with such situation, so i can Markduplicates and run BQSR procedures correctly.

Thanks!

As you surmise, our tools require more information for the read group field than what is present in your downloaded BAM files. These are specified in the Dictionary entry #6472 on read groups here. Your unaligned BAM, the one you use in MergeBamAlignment, should have the correct read group information and read group tags for each read. Otherwise, you won't be able to proceed with using our tools. You can use AddOrReplaceReadGroups to accomplish this.

Ideally, you should represent each lane with a unique read group ID and appropriate Library and Sample identifiers, e.g. the same for these if they are indeed the same library prep and sample. Pooling different lanes of the same sample with the same RGID is NOT good practice. I could go on but I think your line of questioning shows you already understand the implications and are only looking for some confirmation. You'll find the same answer across our forum if you search using your question.

• Beijing Institute of Genomics, CASMember Posts: 111

@shlee

Hi, Is there any easy way to split the merged BAM to multiple BAMs with respect to Flowcell lanes?

Thanks!

• Beijing Institute of Genomics, CASMember Posts: 111

i cannot access my profile properly or edit my comments and also cannot post a new question, please check my profile as well.

Thanks!

edited July 2016

@MUHAMMADSOHAILRAZA, Our tools are read group aware and so they will process your multi-readgroup merged sample level BAM appropriately. If there is some reason you still need to subset reads, then I am aware of two tools that can do this. First is Picard's RevertSam; use the OUTPUT_BY_READGROUP option. Second is GATK's PrintReads; you can exclude/include reads by --platform, --readGroup or --sample_file or --sample_name.

I'll get back to you about your issues accessing your account. If you can do your bit and try again later and let us know if you're still having issues, we can rule out some possibilities.

• Beijing Institute of Genomics, CASMember Posts: 111

@shlee

Hi,
Thanks for response! i am aware of these options and tools that you mentioned above, but i was looking for something that can edit read group information (especially IDs) if we have same RG tag for multiple lanes merged in single BAM something like this:

          @RG     ID:LP6005441-DNA_A09    SM:LP6005441-DNA_A09


So i thought the possible solution might be:
1. Split the BAM file into multiple BAMs per lane > then Assign correctly or replace read groups
2. Or just to replace the read groups in BAM by just reading the lane information (directly in one run).

Thanks!

Geraldine Van der Auwera, PhD

• Beijing Institute of Genomics, CASMember Posts: 111
edited July 2016

@shlee

Hi again,

In case of large BAM files (i.e. per sample per bAM) approximately 100Gb, are the following additional optimizing java jar's parameters are sufficient while piping the scripts together at step 3 (above)?

 java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...


I don't know so much about these parameters,
-Xmx5000m, -Dsamjdk.buffer_size, -XX:GCTimeLimit, -XX:GCHeapFreeLimit, -Dsamjdk.use_async_io=true, -XX:+UseStringCache -

i try to learn about them from internet but not clear about how to set these parameters for large data sets. I know this is not GATK related question, but kindly could you please explain a bit in detail how to use them with one-line comprehensive definition, as you guys routinely use them on large HTS data sets at Broad institute.

Thank you very much!

That question is outside the scope of support we provide. We are here to help people understand how GATK tools work and how to apply them correctly to answer research questions. We are not here to act as general scientific computing consultants.

Please understand that we have a lot of people asking for our help, so we need to prioritize our efforts. You have been asking an unusually large number of questions, so now it is time for you to figure things out for yourself. For the coming weeks, I expect you to refrain from posting questions unless it is something directly related to GATK tools, you have really tried to find the answer in the documentation, and it is not something you can find out by doing some test runs.

Geraldine Van der Auwera, PhD

• Beijing Institute of Genomics, CASMember Posts: 111
edited July 2016

@Geraldine_VdAuwera
Hi,
I just asked a question about the line of java jar's arguments that are suggested above in this documentation page. I know these questions not directly linked with GATK's tools but relevant with smooth running of Picards tool that is going to integrate with GATK (as far i know) in future.
Secondly, whenevr i post these question related to java parameters on internet, the response always comes in return "it depends the tools that you are using with underlying data set" so actually it was not merely as a general scientific consultation.

as i am new in this so thank you very much for your kind patience and responding towards my all queries even when they are not worth answering might be . I know the GATK team is busy. i never posted a question of which answer i can get anywhere else...

Anyway i will take care next time.

Thanks!

• South KoreaMember Posts: 3
edited September 2016

@shlee

Your method is excellent, but the error message at MergeBamAlignment was shown and have not solved up to now.
I tried to solve this using applying FixMateInformation, SortSam, CleanSam independently and together.
Not all the samples make this error.

I would very much appreciate If you help me to solve the problem.

The error seems about "Requesting earlier reference sequence"

java -Xmx10g -Djava.io.tmpdir=tmp1 -jar picard.jar MerageBamAlignment R=reference.fa UNMAPPED_BAM=sample.ubam ALIGNED_BAM=sample.sam O=sample.bam CREATE_INDEX=true ADD_MATE_CIGAR=true CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS VALIDATION_STRINGENCY=SILENT TMP_DIR=tmp2

Exception in thread "main" htsjdk.samtools.SAMException: Requesting earlier reference sequence: 0 < 5
at htsjdk.samtools.reference.ReferenceSequenceFileWalker.get(ReferenceSequenceFileWalker.java:82)
at picard.sam.AbstractAlignmentMerger.calculateMdAndNmTags(AbstractAlignmentMerger.java:596)
at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:390)
at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:157)
at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:266)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

@Kyuwon
Hi,

Can you try re-sorting your BAM file? Please make sure you re-generate the index file too.

Thanks,
Sheila

• USAMember Posts: 7

Hi,

I have a quick question regarding this tutorial. I apologize in advance if it is a trivial/obvious question. In section 3C, you mention that the inputs to MergeBamAlignment, uBAM and alignedBAM should be sorted. The alignedBAM is sorted when it is created, so I don't need to worry about that. But the uBAM, do I have to sort it separately before-hand. I mean, I want to use the MarkIlluminaAdapters. So do I have to sort my original uBAM file before feeding it as an input to MarkIlluminaAdapters. Or does MarkIlluminaAdapters in addition to adding tags, sorts the output as well. The reason I ask this question is, I plan to use the piped command directly, with the output of MarkIlluminaAdapters as the input. So do I need to worry about sorting the original uBAM before the pipeline starts?

Thanks

#### Issue · Github January 18 by Sheila

Issue Number
1647
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee
• USAMember Posts: 7

@slee

Sorry for re-posting this. I have a quick question regarding this tutorial. I apologize in advance if it is a trivial/obvious question. In section 3C, you mention that the inputs to MergeBamAlignment, uBAM and alignedBAM should be sorted. The alignedBAM is sorted when it is created, so I don't need to worry about that. But the uBAM, do I have to sort it separately before-hand. I mean, I want to use the MarkIlluminaAdapters. So do I have to sort my original uBAM file before feeding it as an input to MarkIlluminaAdapters. Or does MarkIlluminaAdapters in addition to adding tags, sorts the output as well. The reason I ask this question is, I plan to use the piped command directly, with the output of MarkIlluminaAdapters as the input. So do I need to worry about sorting the original uBAM before the pipeline starts?

Thanks

• Member, Broadie, Dev Posts: 20

Hi @abhishek_maj08, I think you meant to tag @shlee!

Thanks for that @slee.

In section 3C, the sorting required is query-group-sorted, which can be a slightly different order from query-name-sorted, but can be used interchangeably for most contexts. This is the order that paired-end (PE) reads are in when they are interleaved in section 1. This is also the ordering that MarkIlluminaAdapters expects and maintains in its output. BWA outputs query-group-sorted reads. For PE data, you want to be certain you provide a query-name/group-sorted file to BWA so as not to bias calculations of mean insert size that can arise from providing files containing vestiges of previous alignments, i.e. sort orders based on previous alignment coordinates.

So through all the steps of the workflow, you do not perform any active sorting. Again, no need to worry about sorting.

edited January 19

1. See this github doc I wrote that uses a newer release of Picard and this Picard codebase issue for additional descriptions of MergeBamAlignment features, including a new tag, the FT tag, that the tool adds to unmapped contaminant reads when UNMAP_CONTAMINANT_READS is set to true.

• The UNMAP_CONTAMINANT_READS=true option applies to likely cross-species contamination, e.g. bacterial contamination. MergeBamAlignment identifies reads that are (i) softclipped on both ends and (ii) map with less than 32 basepairs as contaminant (change with MIN_UNCLIPPED_BASES option). For a similar feature in GATK, see OverclippedReadFilter. If MergeBamAlignment determines a read is contaminant, then the mate is also considered contaminant. MergeBamAlignment unmaps the pair of reads by (i) setting the 0x4 flag bit, (ii) replacing column 3's contig name with an asterisk *, (iii) replacing columns 4 and 5 (POS and MAPQ) with zeros, and (iv) adding the FT tag to indicate the reason for unmapping the read, e.g. FT:Z:Cross-species contamination. The records retain their CIGAR strings. Note other processes also use the FT tag, e.g. to indicate reasons for setting the QCFAIL 0x200 flag bit, and will use different tag descriptions.
2. You can now also download the tutorial example data via FTP from ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/.

3. Newer versions of MergeBamAlignment going forward will now copy over AH tags from @SQ header lines if present, and allows the user to specify which header to use in the merged output--that from the uBAM or that from the aligned BAM. See Tutorial#8017 and this Picard codebase issue for a description of the AH tag.

• USAMember Posts: 7

@shlee
Thank you for your response. I should have mentioned that I start off directly with uBAMs. So basically from Step 2. That is why I did sort them before using them as input to MarkIlluminaAdapters. Is that okay?

Did you queryname-sort? Because that is the sort you should be sure to do if you are sorting at this stage from a sort order that used to be coordinate. Besides the reasons I mention above, this also ensures the pairs are interleaved, i.e. next to each other.

• USAMember Posts: 7

@shlee Yes, I sorted by queryname.