The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

# (howto) Map and mark duplicates

edited February 22 in Archive

#### Objective

Map the read data to the reference and mark duplicates.

#### Prerequisites

• This tutorial assumes adapter sequences have been removed.

#### Steps

2. Generate a SAM file containing aligned reads
3. Convert to BAM, sort and mark duplicates

### 1. Identify read group information

The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.

#### Action

Compose the read group identifier in the following format:

@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1


where the \t stands for the tab character.

### 2. Generate a SAM file containing aligned reads

#### Action

Run the following BWA command:

In this command, replace read group info by the read group identifier composed in the previous step.

bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam


replacing the <read group info> bit with the read group identifier you composed at the previous step.

The -M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).

#### Expected Result

This creates a file called aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.

Note that here we are using a command that is specific for pair end data in an interleaved (read pairs together in the same file, with the forward read followed directly by its paired reverse read) fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.

### 3. Convert to BAM, sort and mark duplicates

These initial pre-processing operations format the data to suit the requirements of the GATK tools.

#### Action

Run the following Picard command to sort the SAM file and convert it to BAM:

java -jar picard.jar SortSam \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

#### Action

Run the following Picard command to mark duplicates:

java -jar picard.jar MarkDuplicates \
METRICS_FILE=metrics.txt


#### Expected Result

This creates a sorted BAM file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt containing (can you guess?) metrics.

#### Action

Run the following Picard command to index the BAM file:

java -jar picard.jar BuildBamIndex \


#### Expected Result

This creates an index file for the BAM file called dedup_reads.bai.

Post edited by dekling on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 16Member

Hi,

Following this week’s workshop, I have been trying the pipeline on my own data. Currently I am generating a SAM file containing aligned reads from pair-ended in forward/reverse read files:

bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' human_g1k_v37.fasta sample1_1.fq sample1_2.fq > aligned_reads_sample1.sam

There is a lot of: skip orientation “” as there are not enough pairs (please see below). Does this indicate something is not running correctly?

[M::main_mem] read 100000 sequences (10000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 41485, 1, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (122, 168, 239)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 473)
[M::mem_pestat] mean and std.dev: (186.71, 84.97)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 590)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::worker2@0] performed mate-SW for 58304 reads

Thanks

• Posts: 543Member, Dev ✭✭✭✭

Nope, looks right to me. The vast majority of your reads should be in FR orientation for standard paired end sequencing. It would be a problem if you did have a lot of reads in the other orientations

• Posts: 16Member

Thanks. I am now running Picard to mark duplicates using the command above:

java -jar MarkDuplicates.jar \

ERROR: Option 'METRICS_FILE' is required.

The error message is nice and clear. You need to provide it a metrics file. I suggest: M=dedup_reads.bam.metrics

• Posts: 16Member

Hi

@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1

I have 60 exome sequencing samples, and for each sample I need to modify the tSM accordingly (e.g. tSM:sample1 to sample100). Do I also need to change the tID, tLB and tPU for each sample (make them unique) or are they the same for all samples?

For example:

Sample1
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
Sample2
@RG\tID:group2\tSM:sample2\tPL:illumina\tLB:lib2\tPU:unit2

OR

Sample1
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
Sample2
@RG\tID:group1\tSM:sample2\tPL:illumina\tLB:lib1\tPU:unit1

Thank you

You should use the information from the sequence metadata you get with the sequence. LB is the library identifier and PU refers to the sequencing machine. The ID is usually a unique identifier for each lane.

Geraldine Van der Auwera, PhD

• Leuven / Gent (Belgium)Posts: 17Member

new question(s) here after watching the BroadE videos.

• Is markduplicate with marking but not removing command enough for HaplotypeCaller downstream processing as suggested by the sample commands or should one add the parameter to actually remove the duplicates

• when removing the duplicates, does this keep one prototype read in each duplicate group and if yes what quality score does it store (median, max ...)

• markduplicate is sufficient; HC will automatically apply filters to ignore dupes.

• I forget what is the rule followed to keep a prototype read; it might be random but I'm not certain. This might be documented in the Picard docs (since it's techincally not a step done with GATK, we don't have docs on this).

Geraldine Van der Auwera, PhD

• University of PennsylvaniaPosts: 23Member

Greetings!

I was wondering what kinds of errors could possibly cause the mark duplicate step to output a duplicates.metrics file where in the histogram is nearly-uniform? The input bam file was already sorted.

Thank you!
Stephanie

Hi Stephanie,

Sorry, that's a question for the authors of Picard tools, not us.

By the way, I would recommend not communicating text information in MS Word documents. If you later need to use the information elsewhere, Word can mess up the formatting and cause further issues. Also, other people may not necessarily have a license for MS Word, and may not be able to open the file. Using pure text formats is much better practice for this kind of information.

Geraldine Van der Auwera, PhD

• 湖南长沙Posts: 5Member

when I run the following command to sort the SAM file and convert it to BAM:
java -jar /usr/picard-tools-1.105/SortSam.jar INPUT=sample.sam OUTPUt=sample.sort.bam SORT_ORDER=coordinate

The terminal output the following things:
[Fri Dec 20 17:15:05 CST 2013] net.sf.picard.sam.SortSam INPUT=sample.sam OUTPUT=sample.sort.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
Java HotSpot(TM) Server VM warning: You have loaded library /usr/picard-tools-1.105/libIntelDeflater.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
[Fri Dec 20 17:15:05 CST 2013] Executing as binlangman@binlangman-Lenovo-G450 on Linux 3.2.0-57-generic i386; Java HotSpot(TM) Server VM 1.7.0_45-b18; Picard version: 1.105(1632) JdkDeflater
[Fri Dec 20 17:17:15 CST 2013] net.sf.picard.sam.SortSam done. Elapsed time: 2.16 minutes.
Runtime.totalMemory()=278396928
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
at java.util.ArrayList.iterator(ArrayList.java:814)
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1064)
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1063) at net.sf.samtools.SAMRecord.validateCigar(SAMRecord.java:1436) at net.sf.samtools.SAMRecord.getCigar(SAMRecord.java:606) at net.sf.samtools.SAMRecord.getCigarLength(SAMRecord.java:617) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1599) at net.sf.samtools.SAMLineParser.parseLine(SAMLineParser.java:328) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:237)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:225) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:201)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:665)
at net.sf.picard.sam.SortSam.doWork(SortSam.java:68)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:179)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:120)
at net.sf.picard.sam.SortSam.main(SortSam.java:57)

And the BAM file called sample.sort.bam is null, and didn't contain anything, and it caused I could not finish the subsequent analysis. Why did it occured??? And in this case, what should I do???

Thanks!

Hi @binlangman,

This is an issue with Picard tools, not GATK. Please see the link included in the error message for getting help from the Picard team.

Geraldine Van der Auwera, PhD

• 湖南长沙Posts: 5Member

Thanks!

This is my command line:
bwa mem -M -R ’@RG\tID:cancer\tSM:cancer\tPL:illumina\tLB:cancer\tPU:unit1’ -p human_g1k_v37.fasta cancer_fwd.fastq cancer_rev.fastq -M > cancermem.sam

and I got this error:
[E::bwa_set_rg] the read group line is not started with @RG

Any idea what is the problem here?

@frankib said:
This is my command line:
bwa mem -M -R ’@RG\tID:cancer\tSM:cancer\tPL:illumina\tLB:cancer\tPU:unit1’ -p human_g1k_v37.fasta cancer_fwd.fastq cancer_rev.fastq -M > cancermem.sam

and I got this error:
[E::bwa_set_rg] the read group line is not started with RG

Any idea what is the problem here?

Ok I fixed my problem...

• Posts: 7Member

Hi,

I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb).

But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage.

1. I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data.

2. And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why are simply marking it instead of removing it?

3. And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ?

Thank you.

Hi @anand‌_mt,

1. No that's not normal. Did you run MarkDuplicates on all the data together or per lane of data?

2. Basically it's because we don't like to discard data entirely. We prefer to leave them there just in case.

3. Effective coverage is the coverage after removing duplicates because that's what will be used for analysis, but it's a good idea to look at both before & after, and do a comparison.

Geraldine Van der Auwera, PhD

• New York NYPosts: 1Member

I usually use samtools rmdup, but testing out MarkDuplicates.jar for the first time.
I am running this on 30x genome data, and everything appears to be going OK, except it is stuck at the "Traversing read pair information and detecting duplicates" for over 48 hours. Is that expected? I am using a machine with 64 1.4 Ghz cores and 512GB RAM (400 GB allocated to MarkDuplicates using -Xmx). Has it crashed or should I just wait patiently for it to finish?

My impression is that samtools rmdup is much faster, are there any downsides to using it instead?

Well, MarkDuplicates can be very slow. Maybe check the table of processes on your machine to see if it's still moving at all. If that's doesn't give you a clear answer, you may have better luck asking the Picard tools support folks (MarkDupes is not our tool so there's not that much we can say about it).

Geraldine Van der Auwera, PhD

• Posts: 253Member ✭✭✭

More of a samtools forum question, but it used to be samtools rmdup did not do chimeric molecule duplicate marking while MarkDuplicates.jar did...that is the only thing that I could remember. I have no idea if that is still the case. MarkingDuplicates on non-PCR free WGS libraries always seemed to stink IIRC (take a long time and use lots of RAM).

FYI we have a new implementation of MarkDupes in development that runs waaaaay faster. It will be a while before we can make it available, but just so you know there's hope for the future

Geraldine Van der Auwera, PhD

• Posts: 253Member ✭✭✭

Yay!

• BrazilPosts: 3Member

Good Morning,

I have a question regarding the alignment of reads. Previously these reads undergone a cleanup using Trimmomatic which resulted in two files pair-ends and two single-ends. How should I proceed in this case? Make an alignment for each single-end and use the "merge" the SAMTools?

I thank you and apologize for the difficulty with English.
Have a nice day,
Rhewter.

@Rhewter‌

Hi Rhewter,

I think your idea of making an alignment for each single-end and merging with SAMtools will work.

Good luck!

-Sheila

• United StatesPosts: 28Member

Thank you for this helpful tutorial! Is there any way to fill in the RG fields directly by reading from the FASTQ header containing information about flow cell, sample and library?
Thank you!

@Katie‌

Hi,

Unfortunately, there is no way to do that unless you write your own script.

-Sheila

• New York CityPosts: 2Member
edited December 2014

Hi,

I am using the MarkDuplicates tool to remove PCR duplicates from ATAC-Seq data. I was wondering if you found the answer to @splaisan 's question from Oct 2013 - "when removing the duplicates, does this keep one prototype read in each duplicate group and if yes what quality score does it store (median, max ...)". I just want to be sure it keeps one protopye regardless of the criteria for selection.

The rmdup tool from samtools keeps the read with the best mapping quality and I was wondering how MarkDuplicates deals with this.

Ramya

Hi @ramya88‌ I'm not sure; you should ask this on the Picard tools mailing list since it is not our tool.

Geraldine Van der Auwera, PhD

• New York CityPosts: 2Member

Will do - thanks!

• pakistanPosts: 9Member

HI all
in Identify read group information this command is not running to me @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
terminal shows command not found. can some body explain what is best way to complete this tutorial with running condition ? i have downloaded sample with reference from bio planet.

Hi,

The line you posted is not a command. It is the line in the sam file that contains the read group and sample information.

Take a look under 7. How can I tell if my BAM file has read group and sample information?

-Sheila

• pakistanPosts: 9Member

I have generated aligned_reads_sample1.sam i dont know what to do with it.I have raw file and refrence file and i want to use it via GATK. can somebody guide me with proper steps?? i m very new here.

• pakistanPosts: 9Member

When ever i follow this command java -jar SortSam.jar \ INPUT=aligned_read.sam \ OUTPUT = sorted_read.bam \ SORT_ORDER=coordinate
this gives me Error: Unable to access jarfile SortSam.jar.
What i am doing wrong ?

Hi,

You have not specified the proper path to the SortSam.jar. You need to specify where exactly the SortSam.jar is on your computer. This tutorial may help (under SAMtools): http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-software-packages-required-to-follow-the-gatk-best-practices#latest

-Sheila

• ITMAT Bioinformatics, University of PennsylvaniaPosts: 1Member

Hi,

I saw that using a reference from the bundle is recommended as a best practice. In the hg19 bundle, the reference includes haplotype chromosomes (chr6_apd_hap1) and unplaced chromosomes (chrUn_gl000211). Is it a reasonable workflow to exclude these from the reference? Thus, I would align to chrM,1-22,X and Y and then continue with the Best Practice workflow for DNAseq.

Thank you,
Faith

We recommend including the non-canonical contigs when you are doing the initial mapping to the reference. Afterward you can exclude them when you are running the Best Practices workflow.

Geraldine Van der Auwera, PhD

• pakistanPosts: 9Member

Hello All,

I am working on this command here is my command
bwa mem -M -R ’@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 ’ -p ucsc.hg19.fasta gcat_set_037_2.fastq > aligned_reads.sam
It genertes a currupt sam file. and terminal gives me this error.

[E::bwa_set_rg] the read group line is not started with @RG

Thanks

• pakistanPosts: 9Member
edited February 2015

bwa mem -M -R "@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1" -p ucsc.hg19.fasta gcat_set_037_2.fastq > aligned_reads.sam
after adding double quotes i am getting.
[bwt_restore_sa] fail to open file 'ucsc.hg19.fasta.sa' : No such file or directory

• pakistanPosts: 9Member

Can some body give me any clue regarding to my problem above please !

@sadiq For a problem like that you need to ask for support from the authors of BWA, not us, since it is not our tool. Or ask on SeqAnswers

Geraldine Van der Auwera, PhD

• ChinaPosts: 5Member

I've met some problem while running Picard.
After I generated the sam file and put it in the Picard, it said:
**
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File aligned_reads5.sam; Line 1

at htsjdk.samtools.SAMLineParser.reportFatalErrorParsingLine(SAMLineParser.java:427)
**

• ChinaPosts: 5Member
edited May 2015

The head of sam goes like this:
2 [M::process] read 111112 sequences (10000080 bp)...
3 [M::process] read 111112 sequences (10000080 bp)...
4 [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (4, 42523, 0, 1)
5 [M::mem_pestat] skip orientation FF as there are not enough pairs
6 [M::mem_pestat] analyzing insert size distribution for orientation FR...
7 [M::mem_pestat] (25, 50, 75) percentile: (158, 199, 255)
8 [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 449)
9 [M::mem_pestat] mean and std.dev: (211.41, 71.12)
10 [M::mem_pestat] low and high boundaries for proper pairs: (1, 546)
11 [M::mem_pestat] skip orientation RF as there are not enough pairs
12 [M::mem_pestat] skip orientation RR as there are not enough pairs
13 [M::mem_process_seqs] Processed 111112 reads in 33.378 CPU sec, 33.243 real sec
14 @SQ SN:chrM LN:16571
15 @SQ SN:chr1 LN:249250621
16 @SQ SN:chr2 LN:243199373
17 @SQ SN:chr3 LN:198022430
18 @SQ SN:chr4 LN:191154276

the command I enter is :
nohup bwa mem -M ucsc.hg19.fasta pe_1.fq pe_2.fq > aligned_reads6.sam

@WenhsunWu Your SAM file is contaminated by log messages. There must be an error in the script you are using (probably an improper use of pipes). You must correct your script or run the mapping step manually to produce a valid SAM file.

Geraldine Van der Auwera, PhD

• ChinaPosts: 5Member

@Geraldine_VdAuwera
I've followed the steps strictly one by one and tried many times but it just didn't work.
So is that a problem about BWA itself?Can I simply delete the beginning rows?

@WenhsunWu Can you describe how you run the tools, and what command line you used?

Geraldine Van der Auwera, PhD

• ChinaPosts: 5Member

@Geraldine_VdAuwera
Yes.
bwa index -a ucsc.hg19.fasta

samtools faidx ucsc.hg19.fasta

java -jar picard.jar CreateSequenceDictionary \REFERENCE=/home/zyw/ucsc.hg19.fasta \OUTPUT=/home/zyw/ucsc.hg19.dict

nohup bwa mem -M -R '@RG\tID:Group1\tSM:Sample1\tPL:Illumina\tLB:Lib1\tPU:Unit1' ucsc.hg19.fasta pe_1.fq pe_2.fq > aligned_reads.sam



I ran all this on Debian 3.2.65-1
The bwa version I used is 0.7.12-r1039
samtools version is 0.1.18 (r982:295)
picard version is 1.131

• Posts: 253Member ✭✭✭

@WenhsunWu

Try running the bwa mem command without using nohup. As @Geraldine_VdAuwera says your SAM file is being contaminated with bwa's logging messages (stderr and stdout are both being redirected to file) your command is saying to redirect stdout to file (which is what you are supposed to do), but I am wondering if the reason that stderr is going to file as well is because you are nohupping the command.

Yeah that would be my guess too.

Geraldine Van der Auwera, PhD

• ChinaPosts: 5Member

@Kurt
@Geraldine_VdAuwera
Yeah you two are right! I've figure out how to do that. Thank you!

• Posts: 76Member
edited August 2015

Hi,

I have a question about the read group identifier. Is it essential to keep unique read group identifiers for each sample when I do analysis with GATK?

Usually what I use like this,

sample 1: @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
sample 2: @RG\tID:group1\tSM:sample2\tPL:illumina\tLB:lib1\tPU:unit1

Or I should put the same name for the ID, SM, LB, PU and all other field if any. Like this,

sample 1: @RG\tID:sample1\tSM:sample1\tPL:illumina\tLB:sample1\tPU:sample1
sample 2: @RG\tID:sample2\tSM:sample2\tPL:illumina\tLB:sample2\tPU:sample2

I generally do a single sample analysis each time with BWA, Picard and GATK and then add all those samples in g.vcf. Basically I want to know if this kind of read group identifier information causes any problem to GATK in down stream analysis like sample group identification or any other.

P.S. As most of the time service providers do not provide the sample metadata information alongwith the raw files I had to simply put these kind of RG information while doing data analysis following GATK best practices.

This may be a silly Q but I really need an answer.

• Posts: 76Member

@aneek
Hi,

-Sheila

• Posts: 76Member
edited August 2015

Hi @Sheila,
Thanks a lot for the information. I have another query. Once I add @RG informations during alignment through BWA and mark duplicates Picardtools, is it necessary again to do the AddorReplaceReadGroups step of Picardtools before going for Indel realignment through GATK?

I think this step should be optional and should only be used if no @RG informations are added during BWA alignment or to modify the @RG informations in the BAM file. Please correct me if I am wrong.

Thanks..

@aneek
Hi,

You are correct you do not need to use Picard's AddOrReplaceReadGroups if you already have Read Group information you are happy with in the bam files.

-Sheila

• Posts: 76Member

@Sheila

Thank you very much for the suggestion.

• Posts: 76Member

Hello,

  I was obtaining parsing error messages while converting my SAM file to BAM using Picardtools.


When I validated the SAM file I have found following error messages

ERROR: Record 145857980, Read name HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660, Read length does not match quals length
WARNING: Record 145857980, Read name HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660, NM tag (nucleotide differences) is missing

Please suggest how to correct or remove or ignore these errors in my SAM file.

Thank you.

These errors are very alarming, especially the first one. How are you processing your data?

Geraldine Van der Auwera, PhD

• Posts: 76Member

@Geraldine_VdAuwera

Thanks for quick response. I was processing Illumina paired end data. I had used bwa mem to create the SAM file from both the fastq files (R1&R2). The commandlines I have used for this are as follows,

bwa mem -M -R "@RG\tID:SHAREY\tLB:SHAREY\tSM:SHAREY\tPL:ILLUMINA" hg19 Sharey_R1.fastq Sharey_R2.fastq > Sharey.sam

Then I was using picardtools fro SAM to BAM conversion using following commandlines,

java -jar picard.jar SortSam INPUT=Sharey.sam OUTPUT=Sharey_sorted.bam SORT_ORDER=coordinate

Now I am stuck in this step because this generates the above mentioned parsing error messages instead of generating a BAM file.

Please suggest how to solve this problem.

• Posts: 76Member

For your information the parsing error message after SortSAM is as follows,

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /share/apps/HumanExome/Sharey.sam; Line 145858075
Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII
at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:439)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:324)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518) at picard.sam.SortSam.doWork(SortSam.java:81) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) • Posts: 76Member Any suggestion to my query.. • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @aneek Hi, Try SamFormatConverter to conver the sam file to bam file. http://broadinstitute.github.io/picard/command-line-overview.html#SamFormatConverter -Sheila • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @aneek Hi again, Please ignore my previous comment. It turns out you need to sanitize your bam file. You can do this with RevertSam. http://broadinstitute.github.io/picard/command-line-overview.html#RevertSam There is an option to sanitize which will remove reads that are causing the error. -Sheila • Posts: 76Member @Sheila Thank you very much for suggestion. I have used the RevertSam function but the still the problem persists. I have used the following commandline in picardtools, java -jar picard.jar RevertSam INPUT=Sharey.sam OUTPUT=Sharey_sanitized.sam SANITIZE=true The program is getting finished with following error messages instead of generating any sanitized (error free) SAM file. Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /share/apps/HumanExome/Sharey.sam; Line 145858075 Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII Please help. #### Issue · Github November 2015 by Sheila Issue Number 318 State closed Last Updated Closed By chandrans • Posts: 76Member Hello, Any further suggestion please... • Posts: 10,469Administrator, Dev admin Try CleanSam Geraldine Van der Auwera, PhD • Posts: 76Member @ Geraldine Thanks for the suggestion. I have tried CleanSam too. It is generating the cleaned SAM file b ignoring the parsing errors i.e. Ignoring SAM validation error due to lenient parsing: Error parsing text SAM file. length(QUAL) != length(SEQ); File /share/apps/HumanExome/Sharey.sam; Line 145858075 Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII Ignoring SAM validation error due to lenient parsing: Error parsing text SAM file. Read length does not match quals length; File /share/apps/HumanExome/Sharey.sam; Line 145858075 Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII [Fri Nov 13 16:04:53 IST 2015] picard.sam.CleanSam done. Elapsed time: 27.94 minutes. Runtime.totalMemory()=440401920 However while creating the sorted BAM file from this SAM file using SortSam, giving the same error messages without generating any output. Should I keep the Validation Stringency Lenient? • Posts: 10,469Administrator, Dev admin What happens when you set validation to strict? Geraldine Van der Auwera, PhD • Posts: 76Member @Geraldine_VdAuwera Hi, Keeping the validation stringency strict also giving the same error messages like earlier. • Posts: 76Member Any further suggestion? #### Issue · Github November 2015 by Sheila Issue Number 366 State closed Last Updated Assignee Array Milestone Array Closed By chandrans • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @aneek Hi, If the error is caused by just one read, you can remove it manually. If there is more than one read causing the error, you can redo the mapping and make sure to do it with the latest version of the mapper. This sounds like a bug that may have been fixed in the latest version. -Sheila • USAPosts: 8Member Identify read group info for multiplexed samples I am familiar with the situation where I have multiple libraries of the same sample or when we run the same library on multiple lanes. My question is about multiplexing. Let us say I have 2 samples (S1 and S2). I prepared both samples at the same time for paired end sequencing 2x100 with 400bp insert size then indexed and pooled. However I should give each sample a unique LB tag, right? Then the pooled sample was sequenced on the same lane. I should give the sequence from each sample a different ID tag but the PU tag should be the same for both samples, right? These are the header lines of the 1st read in the R1 file of each sample: @D3VG1JS1:214:C7RNWACXX:1:1101:1128:1956 1:N:0:ATTCCT @D3VG1JS1:214:C7RNWACXX:1:1101:1088:1987 1:N:0:GATCAG My proposed read info for the 1st sample should be: "@RG\tID:@D3VG1JS1.214.C7RNWACXX.1.sample1\tSM:S1\tPL:Illumina\tLB:Lib1\tPU:@D3VG1JS1.214.C7RNWACXX.1" My proposed read info for the 2nd sample should be: "@RG\tID:@D3VG1JS1.214.C7RNWACXX.1.sample2\tSM:S2\tPL:Illumina\tLB:Lib2\tPU:@D3VG1JS1.214.C7RNWACXX.1" I have revised the sam file specification (https://samtools.github.io/hts-specs/SAMv1.pdf). Also I found these links on the GATK related material but non gave me a clear answer https://www.broadinstitute.org/gatk/guide/article?id=3059 https://www.broadinstitute.org/gatk/guide/article?id=6472 (Unfortunately, very confusing) https://www.broadinstitute.org/gatk/guide/article?id=1317 Thank you • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin Yes to all of your questions. You are correct about giving the samples different libraries and IDs, but the same PU tag. -Sheila • USAPosts: 8Member Thank you • Posts: 76Member edited February 15 @Sheila said: @aneek Hi, If the error is caused by just one read, you can remove it manually. If there is more than one read causing the error, you can redo the mapping and make sure to do it with the latest version of the mapper. This sounds like a bug that may have been fixed in the latest version. -Sheila Hi, The problem still persists. You said it may be a bug and will be fixed in the latest versions. I am using bwa version 0.7.12 for mapping and picardtools version 1.141 for sorting. I have also used stampy version 1.0.28 for mapping, giving different parsing errors than bwa while sorting through picardtools, ![](Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File /export/apps/HumanExome/Sharey.sam; Line 157676417 Line: HWI-700463F:78:H3VCJBCXX:2:2210:14020:72541 99 chr10 6195816967 87M11I2M = 61958168 87 CGGAAAGGTGAAGTGGGGTGTACCCAGAAGTTGTGGCTGCATTTGGAGATGCCCCTTGCTGCAACAGCTGTTGTACTATGTCTGCTTAGA at htsjdk.samtools.SAMLineParser.reportFatalErrorParsingLine(SAMLineParser.java:432) at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:217) at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
at picard.sam.SortSam.doWork(SortSam.java:81)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95))

#### Issue · Github February 19 by Sheila

Issue Number
601
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

@aneek
Hi,

I'm not sure what is happening to your files. It seems like you are somehow getting formatting errors, and we cannot help easily with those. The only suggestion for now is to upgrade to the latest version of Picard. The one you are using is quite old.

-Sheila

• TaiwanPosts: 2Member

Hi,
I have a problem by using picard.jar MarkDuplicates in R.
I decided to follow the workflow in this paper: https://www.bioconductor.org/help/workflows/chipseqDB/
As above example, I got an error message while I tried to use MarkDuplicates.
When I import the following command in R:

temp.bam <- "h3k9ac_temp.bam"
temp.file <- "h3k9ac_metric.txt"
temp.dir <- "h3k9ac_working"
dir.create(temp.dir)
for (bam in bam.files) {
code <- system(sprintf("MarkDuplicates I=%s O=%s M=%s \
TMP_DIR=%s AS=true REMOVE_DUPLICATES=false \
VALIDATION_STRINGENCY=SILENT", bam, temp.bam,
temp.file, temp.dir))
stopifnot(code==0L)
file.rename(temp.bam, bam) }

I got:

I have already checked the JAVA version.
And both of the following command can also be done.
java -jar /path/to/picard.jar –h
java -jar \$PICARD –h

So I am wondering why I got this trouble in R.
Did I miss something else?

@AbbieHsu
Hi,

Are you running on a Windows machine by any chance? Can you try running the command in a regular linux terminal? I don't think we can help you with what you are trying to do. Maybe you can try asking the authors of the paper for help.

-Sheila

• TaiwanPosts: 2Member

Hi,
I'm running the command in a regular linux terminal. Thanks for your kindly reply. I'll ask the authors for help

• Sri LankaPosts: 1Member

Hi all,

Actually I'm new to GATK and bioinformatics. Nowadays I doing 16S analysis with iron-torrent data. Now I have sam file that align with silva db as reference using Bowtie2. Could you help me what would be the next step that I could follow to remove duplicates form sam file?

@Ybanet
HI,

Have a look at the Best Practices and this new document which replaces this one.

-Sheila

• University of WashingtonPosts: 1Member

Hello,

Our sample consists of a pool of purified viral genomes from a DNA virus that we have put under selective pressure and passaged in bulk, essentially a quasispecies of mutated viral genomes. As such, duplicates could originate from different templates within the same sample. Would you still recommend marking duplicates in this scenario?