The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

Picard 2.10.2 is now available at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

# (How to) Generate an unmapped BAM from FASTQ or aligned BAM

edited June 23

Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

#### Prerequisites

• Installed Picard tools

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

### (A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
OUTPUT=6484_snippet_fastqtosam.bam \
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400


Some details on select parameters:

• For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
• For single ended reads, specify the input file with FASTQ.
• QUALITY_FORMAT is detected automatically if unspecified.
• SORT_ORDER by default is queryname.
• PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
• RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ


@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################


After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2


### (B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
I=6484_snippet.bam \
O=6484_snippet_revertsam.bam \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
SORT_ORDER=queryname \ #default
RESTORE_ORIGINAL_QUALITIES=true \ #default
REMOVE_DUPLICATE_INFORMATION=true \ #default
REMOVE_ALIGNMENT_INFORMATION=true #default


To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory


We invoke or change multiple RevertSam parameters to generate an unmapped BAM

• We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[1]++) { print }}'  For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags. • Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval. • In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing. Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  Some comments on options kept at default: • SORT_ORDER=queryname For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed. • RESTORE_ORIGINAL_QUALITIES=true Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag. • REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false. Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced. For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads. Original BAM H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339 CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:0 MQ:i:0 OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA UQ:i:0 AS:i:151 H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:2 MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### UQ:i:49 AS:i:50  After RevertSam H0164ALXX140820:2:1101:10003:23460 77 * 0 0 * * 0 0 TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2 H0164ALXX140820:2:1101:10003:23460 141 * 0 0 * * 0 0 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2  back to top Post edited by shlee on Tagged: ## Comments • Member edited April 2016 Hi, How to perform FastqToSam function in picardtools when the fastq files are not interleaved i.e. forward (R1) and reverse (R2) fastq files separately. Picardtools is not taking two FASTQ inputs. Error message: "Option 'FASTQ' cannot be specified more than once." • Broad InstituteMember, Broadie, Moderator @aneek Hi, If you look at the FastqToSam documentation, you will find your answer. -Sheila • Member edited April 2016 @Sheila Thank you very much. I have to use FASTQ2 for the 2nd read. Another thing I would like to ask is, whether there is any need to use FastQC and Trimmomatic to correct the quality of the raw data and remove the bad reads, if I use these steps of making a clean BAM from short read sequence recommended by GATK. Also can I use threading option (-t) in all GATK commands to increase the speed of the tasks. Thank you. • Broad InstituteMember, Broadie, Moderator @aneek Hi, No, it is best to stick to the Best Practices. There should be no need to correct for anything before doing the Best Practices pre-processing steps. You can find which tools accept -nt in the tool documentation. -Sheila • Member @Sheila Hi, Thank you very much for all the informations. • Beijing Institute of Genomics, CASMember edited July 2016 @Sheila Hi, I cannot download the example data set "tutorial_6484_RevertSam.tar.gz"/"tutorial_6484_FastqToSam.tar.gz" from above links.. In china, Because we don't have complete access to google, may be due to that reason i cannot download the data from google drive.. could you please make it available somewhere else? Like as you do for "bundle" as an "example" directory.. Thanks • CambridgeMember, Broadie, Moderator edited July 2016 I've uploaded the tutorial datasets to the ftp site. You can find the data for this and other tutorials at: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/. Like with the bundle, remember to keep the password field blank. • Beijing Institute of Genomics, CASMember Thanks! @shlee • Beijing Institute of Genomics, CASMember @shlee Hi, I am curious Why RG tag is absent for the first read of above "After RevertSam" results section? and could you please explain what does 77 and 141 values represent in column 2? Thanks • CambridgeMember, Broadie, Moderator You can check yourself why this may be with the command: samtools view 6484_snippet_revertsam.bam | grep 'H0164ALXX140820:2:1101:10003:23460'  Then you'll see that both reads should have an RG tag. I've updated the example--thanks for pointing this out. For an explanation of SAM flags, I refer you to the following: • NYMember edited September 2016 Hi, I tried to convert my paired-end fastqs to SAM using FastqToSam, but got the error: "In paired mode, read name 1 (HWI-D00377:30:H8EJDADXX:1:2209:8491:93586) does not match read name 2 (HWI-D00377:30:H8EJDADXX:1:2104:20024:30303)". Is there anyway to fix it? Thanks! #### Issue · Github September 2016 by Sheila Issue Number 1302 State closed Last Updated Assignee Array Milestone Array Closed By sooheelee • CambridgeMember, Broadie, Moderator This question has been answered here. • SwitzerlandMember Hi, I am using the script reported in paragraph "(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam" As suggested, I set the SANITIZE option as true (indeed, the process has problems already at the revertsam-->revertmark step when setting it as false). However, for some older bam files, I loose as much as half of the reads. Meaning that my interleaved fastq file as well as my remapped bam is much smaller in size and script number. Is there a way to circumvent such a big loss? Thanks! • CambridgeMember, Broadie, Moderator Hi @abor, When you say you lose up to half of the reads, do you mean alignment records or unique reads? I ask because an aligned BAM can represent the same read in multiple alignment records and the reversion to uBAM has the effect of paring these down to one unique read. Can you also describe further the problem you are running into? It would be helpful to us if you can post the tool error message. • SwitzerlandMember Hi, thanks. the err file reports: "INFO 2016-11-25 15:41:50 RevertSam Discarded 23263512 out of 61936806 (37.560%) reads in order to sanitize output." Consistently, the read count in my original bam file is 61,955,464, in my interleaved fastq is 38673294 and in my final recalibrated bam (after realignment/bqsr) is 38690072 • CambridgeMember, Broadie, Moderator edited November 2016 Ok, @abor. So I think the question becomes why do you want to save the discarded reads? Consider what property of the discarded reads caused them to be discarded given SANITIZE discards paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Then consider what impact these reads have on your analyses. E.g. do the remaining reads provide sufficient coverage or do you need all the coverage you can salvage? Do the discarded reads qualitatively differ, e.g. have lower base qualities, than the remaining reads? I think if they are qualitatively equivalent, then only the paired reads with missing mates would be worth salvaging. If this is the case, then we'd have to think about how to retain such reads to the exclusion of the other sanitized reads. So you would first RevertSam without sanitizing. Then, I believe you can filter records with mismatches in length of bases and qualities using GATK's MalformedReadFilter's optional parameter --filter_mismatching_base_and_quals. Because your BAM at this point is unaligned, and therefore not coordinate-sorted, you have to run PrintReads in --unsafe mode and discard the bai index that the tool outputs. As for duplicated records, a UNIX sort and uniq should remove these. For example, for your unaligned uBAM: samtools view ubam.bam | sort | uniq > ubam_uniquify.sam  Will print the unique reads to a SAM format file. You'll have to reattach your header before converted again to BAM format. Then: java -jar GenomeAnalysisTK.jar -T PrintReads \ -I uniquify_ubam.bam \ -o ubam_filter_mismatching_base_and_quals.bam \ -R your_reference.fasta \ --filter_mismatching_base_and_quals \ --unsafe  Also, take a look at the ValidateSamFile document. I hope this is helpful. Post edited by shlee on • SwitzerlandMember edited January 17 Hi @shlee, sorry for the long delay. I went through the steps you suggest: on my original BAM file I use RevertSam without SANITIZE, then use sort/uniq and then PrintReads with --filter_mismatching_base_and_quals --unsafe. The number of reads remains exactly the same I had in the original BAM; in addition, running ValidateSamFile MODE=SUMMARY gives me the following error: ERROR:MATE_NOT_FOUND, and the number of reads with this error is exactly the same number of reads that are discarded when I use RevertSam with SANITIZE=true. Turns out that the discarded reads are reads with missing mates (that were, however, regularly mapped in my original BAM), and that I would like to keep, given that they are as much as 15-45%of total reads, for some samples. However, the MarkIlluminaAdapters tool doesn't process the uBam file unless they have been discarded. Do you think there may be a solution for this step to work properly also in the presence of reads with missing pairs? Thanks Post edited by abor on • SwitzerlandMember edited January 17 Plus, I wonder why this happens. When using Bam files in which the RealignerTargetCreator and BaseRecalibrator steps are performed through the whole genome, I can pass the Bam files again through the pipeline (RevertSam, MarkIlluminaAdapters, SamToFasq, BWA, MarkDuplicates and GATK RealignerTargetCreator/BaseRecalibrator) several times without loosing any reads; instead, I notice that when using the -L option on the bait regions at the RealignerTargetCreator and BaseRecalibrator steps, reads are discarded if the BAMs are subsequently re-passed through RevertSam. Of course I don't need to re-run RevertSam several times, but I could need to do it in the future. So the question is whether the -L option at RealignerTargetCreator and BaseRecalibrator steps introduces any tag in the read so that it is then discarded as "missing mate" when passed again through RevertSam. Hope to have been clear enough. Thanks again for your help Post edited by abor on • Cambridge, MAMember, Administrator, Broadie Removing reads that cause validation errors is what SANITIZE=TRUE is for. If you don't want to remove them you need to disable that option -- but then some other tools may refuse to run on that data. Our supported pipelines all discard unpaired reads as a quality assurance measure. When you use -L in the preprocessing steps, do you use it also with IndelRealigner and PrintReads? • SwitzerlandMember edited January 17 Edited post: no, -L was used in RealignerTargetCreator, BaseRecalibrator and PrintReads, but not in IndelRealigner • SwitzerlandMember The question is why reads that were previously successfully mapped (they are mapped in the BAM file I use as input for RevertSam) are detected as unpaired reads and discarded • CambridgeMember, Broadie, Moderator edited January 20 Hi @abor, Since you mention that you are using older BAM files, I assume you don't have access to the original FASTQ files and the missing mates are forever gone. Is this true? It's my understanding that aligners such as BWA align each read independently, even ends of mated reads. That GATK tools have built in quality control features that are different from that of aligners is to a user's benefit. At the least, you become aware of the quality of the data you are working with and can come to a decision on how to proceed with the discrepant data. In answer to your question, Do you think there may be a solution for this step to work properly also in the presence of reads with missing pairs? my answer is yes, there is a solution. My understanding is that you wish to (i) process data through MarkIlluminaAdapters as well as (ii) keep the mate not found reads. Roughly, I think you can separate your paired reads and mate not found reads and process them as PE and SE reads separately, respectively. For example, because processing through MarkIlluminaAdapters only makes sense for PE reads (see Tutorial#6483 for why), you would process only the PE BAM through this step and skip this step for the SE BAM. Sorry, it appears MarkIlluminaAdapters has a parameter to process SE reads (MIN_MATCH_BASES_SE). To merge the reads together for joint processing for downstream steps, well, here's where you'd have to do some exploration of tools. I think you would use either MergeSamFiles or MergeBamAlignment. Tutorial#6483 gives details on MergeBamAlignment. It will take as input multiple ALIGNED_BAM files, so you can provide it both the aligned PE BAM and aligned SE BAM. I hope this is helpful. Let us know what ends up working for you and if you have additional questions. Post edited by shlee on • Member Hi all, I current have some RNA-Seq Tophat aligned files (no original fastq), both the Aligned bam and unmapped reads files and wanted to generate a fastq to align with another aligner (STAR or Salmon). Do you think I should try to merge the Aligned and unmapped files and then perform the conversion? I have a feeling that might not be correct because it would insert potential reads with no mate. Otherwise I'm a little stuck as to what to do with the unmapped file. Thanks for any help! Darryl #### Issue · Github January 19 by Sheila Issue Number 1648 State closed Last Updated Assignee Array Milestone Array Closed By sooheelee • Cambridge, MAMember, Administrator, Broadie @abor When you use -L with PrintReads, you're telling GATK to throw away any reads outside of those intervals. Then any reads inside the intervals that had mates outside become mate-not-founds. • CambridgeMember, Broadie, Moderator Hi Darryl ( @dnousome ), Whichever file contains the complete set of read pairs, you can use this particular file. STAR takes FASTQ files and I assume Salmon also takes FASTQ files. If the file you start with is the aligned BAM, you can revert using RevertSam, convert to FASTQ and then align with your aligner of choice. If it is the uBAM, then you can proceed with converting to FASTQ and then align. If you wish to pipe conversion to FASTQ and the alignment step, which allows you to avoid storing a FASTQ file, then checkout the workflow detailed in Tutorial#6483 for BWA alignment. I believe the concepts should still apply to your case. The tutorial then explains how and why you would use MergeBamAlignment. I hope this is helpful. • Member @shlee Most definitely! Thanks for your help. • CambridgeMember, Broadie, Moderator @abor, I've modified my last answer to you above. • CambridgeMember, Broadie, Moderator ## Update to RevertSam An issue described by this forum question has been fixed by a Picard repo code change documented here as a pull request and here as the related github issue. The effect of this code change is that when encountering mate-missing reads in a PE data file, RevertSam will now remove all mate information and thereby effectively turn mate-missing records into SE reads. • IRB BarcelonaMember Dear all, I have downloaded BAM files deposited in EGA from a study conducted some years ago. I do no have information of the "state" of the BAM file. My idea is to process them again and then do the somatic variant calling with Mutect2. I have checked a bit the BAM files and they are mapped and sorted by coordinates. Just to make sure, do I have to follow part B of this tutorial? If I directly convert the BAM files (without applying RevertSam) into fastq files do I have to unmapped the BAM file first and then convert them to fastq? Thanks! • CambridgeMember, Broadie, Moderator edited May 19 Hi @inesentis, There are many factors for you to consider, including: 1. How were the samples in the Panel of Normals (PoN) you will use preprocessed? 2. Do your BAM files contain multiple read groups? Remember that RevertSam can separate these out with the OUTPUT_BY_READGROUP option and that BQSR is performed at the lane level (different lanes must have different RG IDs). Alternatively, you could potentially queryname-sort your reads to identify the blocks that correspond to different read groups, if your BAMs follow the convention that read names correspond to the RG ID or RD PU fields. 3. How are the different readgroups organized? Is there a PU field that also distinguishes them? 4. Do your BAM files contain QC-failed reads (NON_PF marked by the 0x200 SAM flag)? You may consider removing these to ensure a conservative analysis. 5. Do your BAM files contain SE reads or mate-missing reads from a particular read group or mixed in with PE reads? 6. Are your tumor and normal pair samples named to differentiate them in the @RG SM field? MuTect2 requires these be different. A first step towards answering these questions is to run your BAMs through ValidateSamFile as well as samtools flagstat. If you will use MuTect2, if possible please be sure to use the GATK4 MuTect2, instead of the GATK3 MuTect2. Our developers say that GATK4's MuTect2 (in beta) is now ready for testing. It has improvements to GATK3's MuTect2. You can find the jar at https://github.com/broadinstitute/gatk-protected/releases/tag/1.0.0.0-alpha1.2.6. Remember that beta status means it is only for testing and not necessarily ready for production purposes. • IRB BarcelonaMember 1. I was not thinking of using a PoN since the data that I am downloading are tumor/normal pairs. Is it necessary? How do I built PoN for a study based on normal/tumor pair of samples? 2. The problem with this BAM files is that there are no different RG IDs. In other words, all the files (even form different patients and irrespective of tumor/normal) have the same RG ID. To me it seems that they have sequenced all in the same lane or maybe at some point they aggregate or collapse all the readgroups in the same BAM file and they lost the unique IDs. Not sure how to proceed in this case. 3. No RG PU tag found. 4. The QC -failed reads must have been removed since I do not have them when computing >samtools flagstat anybam.bam  1. I do have some singletons. Is there a threshold from which one can not worry or be worried about? 2. Yes! They do have different sample names • CambridgeMember, Broadie, Moderator edited May 22 Hi @inesentis, I highly recommend the use of a PoN for somatic variant calling. You can see how to make one in section 1 of the MuTect2 hands-on tutorial listed here. The tutorial's PoN, I made it myself using publicly available 1000 Genomes Project data. I used forty WES datasets to match my tumor-normal pair WES dataset. The MuTect2 tool documentation also tells you how to go about making a PoN. If you follow along the hands-on tutorial, using the provided example data, you see the value of using a PoN, albeit a minimal PoN made from 40 samples, in sections 5 and 6. I have WDL scripts for creating this PoN in the cloud that I can share with you. You'll have to make your own Docker image however as the Docker I use is my personal private Docker container and is unshareable. Even if there are no RGIDs, it may be possible to differentiate lane-level data from the read names themselves as I allude to in my previous post. This depends on if the read names are the original or if they were changed. Alternatively, you could potentially queryname-sort your reads to identify the blocks that correspond to different read groups, if your BAMs follow the convention that read names correspond to the RG ID or RD PU fields. In addition to the QC-failed reads that you say are already removed, you should see if reads from short inserts, where there is adaptor sequence read-through, still remain in the file. Depending on the extent of these, you may wish to remove them or proceed knowing there is a minimal amount of these. You can use Picard's MarkIlluminaAdapters for such metrics. Singletons can still inform your analyses. Just make sure the unmapped mate is present in the file as well. Otherwise, this may cause problems with some tools, e.g. MarkIlluminaAdapters. • Member I was successfully able to do this step for Tutorial #6483. However, when using RevertSam with the same settings, except with one of my own BAM files, I encountered the following error: "CIGAR M operator maps off end of reference" Is there a processing step I have to do to the BAM file before I use RevertSam to get a uBAM? • CambridgeMember, Broadie, Moderator edited June 29 Hi @jfiksel, That doesn't sound right. Sanitizing and reverting alignments to uBAM should not care about alignment information. We would want to keep these reads, since presumably we are reverting for fresh (and hopefully better) alignment results. Would you mind submitting a bug report with a snippet of your data that allows us to recapitulate the error you are seeing? This will really help us. Instructions are in Article#1894: https://software.broadinstitute.org/gatk/guide/article?id=1894. • Member Hi @shlee, Thanks for the response. I believe that I have uploaded the necessary files to the ftp and they should be located in jfiksel_revertsam_bug.zip. If they're not there, it would be great if you could link to how to upload files to the ftp server, since I have never used ftp file transfer before. Jacob #### Issue · Github June 29 by shlee Issue Number 2231 State closed Last Updated Assignee Array Milestone Array Closed By sooheelee • CambridgeMember, Broadie, Moderator Hi @jfiksel, I can recapitulate the error with your test data. I've requested that RevertSam, when REMOVE_ALIGNMENT_INFORMATION=true, not get hung up on odd alignments due to read filters. The cause of the error was that the CIGAR string did not correctly represent the beyond-end-of-reference alignment. For the particular read in the error message, instead of a 100M, the CIGAR should be 74M26S. To get you going to where you need to go, you can run CleanSam on your data before running RevertSam. CleanSam will soft-clip beyond-end-of-reference alignments and correct the CIGAR string. Thanks again for the test data. • Member Thanks for the quick response @shlee ! I successfully was able to complete these steps after first running CleanSam and then running AddOrReplaceReadGroups. I noticed that if I did this without adding in read groups (but after running CleanSam), I got a null pointer exception error. It would be great if you could keep me updated with the status of this issue, since running CleanSam can add a significant amount of processing time for some of the larger WGS files I'm working with. • CambridgeMember, Broadie, Moderator @jfiksel, you can track the status of this bug fix at https://github.com/broadinstitute/picard/issues/849. You can also comment in these issue tickets as Picard is an open-source repo. Depending on who picks up this work, someone here at the Broad or external, they may ask that we share your test data. Would you be okay with that? • Member Thanks @shlee . And yes, that is no problem. • San FranciscoMember edited July 21 Hi folks, We're running into a problem with Revert Sam. When we run: "java -Xmx8G -jar{PICARD_PATH}/picard.jar RevertSam I=${SAMPLE_NAME}.bam O=${TMP_DIR}/\${SAMPLE_NAME}_rev.bam
ATTRIBUTE_TO_CLEAR=XT ATTRIBUTE_TO_CLEAR=XN ATTRIBUTE_TO_CLEAR=AS ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP ATTRIBUTE_TO_CLEAR=X0 ATTRIBUTE_TO_CLEAR=AM ATTRIBUTE_TO_CLEAR=SM"

we are getting, one 2/22 files:

Running RevertSam
[Wed Jul 19 00:41:20 UTC 2017] picard.sam.RevertSam INPUT=R7495_N.bam OUTPUT=/TMP_R7495_N/R7495_N_rev.bam OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS] SANITIZE=false MAX_DISCARD_FRACTION=0.01 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jul 19 00:41:20 UTC 2017] Executing as root@135b4c4f0e37 on Linux 3.16.0-4-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Picard version: 2.9.2-SNAPSHOT
...
Exception in thread "main" picard.PicardException: Two reads with same name but not correctly marked as 1st/2nd of pair: C4JL4ACXX140706:6:1101:10015:69192
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Any thoughts about what is going on, or how we can get around it? We aren't using SANITIZE=true because of its potentially destructive nature, but I would be fine throwing single reads out (or weird pairs, like this) if we could avoid this issue.

All help will be great!

Hi @dannykwells,

Any reason to suspect your BAM contains a mix of PE and SE reads? If so, try setting OUTPUT_BY_READGROUP to true. Or is this RNA data that have been split by Ns?

The error:

Exception in thread "main" picard.PicardException: Two reads with same name but not correctly marked as 1st/2nd of pair: C4JL4ACXX140706:6:1101:10015:69192

occurs when the tool encounters reads with the same name as would typically occur for mates or for secondary/supplementary alignments. The tool checks for first and second of pair flags and if the two reads lack each of these, then it throws an exception you see.

• San FranciscoMember

Hi,

These samples are DNA, and all were sequenced together (at the Broad, actually). 22 samples were done together, and 5 are throwing this error.

Any other thoughts on how to get around this?

@dannykwells, can you post the records? That is, can we take a look at, e.g. C4JL4ACXX140706:6:1101:10015:69192`?