We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
(howto) Map and mark duplicates

See Tutorial#6747 for a comparison of MarkDuplicates and MarkDuplicatesWithMateCigar, downloadable example data to follow along, and additional commentary.
Objective
Map the read data to the reference and mark duplicates.
Prerequisites
- This tutorial assumes adapter sequences have been removed.
Steps
- Identify read group information
- Generate a SAM file containing aligned reads
- Convert to BAM, sort and mark duplicates
1. Identify read group information
The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.
Action
Compose the read group identifier in the following format:
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
where the \t
stands for the tab character.
2. Generate a SAM file containing aligned reads
Action
Run the following BWA command:
In this command, replace read group info by the read group identifier composed in the previous step.
bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam
replacing the <read group info>
bit with the read group identifier you composed at the previous step.
The -M
flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).
Expected Result
This creates a file called aligned_reads.sam
containing the aligned reads from all input files, combined, annotated and aligned to the same reference.
Note that here we are using a command that is specific for pair end data in an interleaved (read pairs together in the same file, with the forward read followed directly by its paired reverse read) fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.
3. Convert to BAM, sort and mark duplicates
These initial pre-processing operations format the data to suit the requirements of the GATK tools.
Action
Run the following Picard command to sort the SAM file and convert it to BAM:
java -jar picard.jar SortSam \ INPUT=aligned_reads.sam \ OUTPUT=sorted_reads.bam \ SORT_ORDER=coordinate
Expected Results
This creates a file called sorted_reads.bam
containing the aligned reads sorted by coordinate.
Action
Run the following Picard command to mark duplicates:
java -jar picard.jar MarkDuplicates \ INPUT=sorted_reads.bam \ OUTPUT=dedup_reads.bam \ METRICS_FILE=metrics.txt
Expected Result
This creates a sorted BAM file called dedup_reads.bam
with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt
containing (can you guess?) metrics.
Action
Run the following Picard command to index the BAM file:
java -jar picard.jar BuildBamIndex \ INPUT=dedup_reads.bam
Expected Result
This creates an index file for the BAM file called dedup_reads.bai
.
Comments
Hi,
Following this week’s workshop, I have been trying the pipeline on my own data. Currently I am generating a SAM file containing aligned reads from pair-ended in forward/reverse read files:
bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' human_g1k_v37.fasta sample1_1.fq sample1_2.fq > aligned_reads_sample1.sam
There is a lot of: skip orientation “” as there are not enough pairs (please see below). Does this indicate something is not running correctly?
[M::main_mem] read 100000 sequences (10000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 41485, 1, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (122, 168, 239)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 473)
[M::mem_pestat] mean and std.dev: (186.71, 84.97)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 590)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::[email protected]] performed mate-SW for 58304 reads
Thanks
Nope, looks right to me. The vast majority of your reads should be in FR orientation for standard paired end sequencing. It would be a problem if you did have a lot of reads in the other orientations
Thanks. I am now running Picard to mark duplicates using the command above:
java -jar MarkDuplicates.jar \
INPUT=sorted_reads.bam \
OUTPUT=dedup_reads.bam
But received the following error:
ERROR: Option 'METRICS_FILE' is required.
The error message is nice and clear. You need to provide it a metrics file. I suggest: M=dedup_reads.bam.metrics
Hi
I have a question about the read group identifier:
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
I have 60 exome sequencing samples, and for each sample I need to modify the tSM accordingly (e.g. tSM:sample1 to sample100). Do I also need to change the tID, tLB and tPU for each sample (make them unique) or are they the same for all samples?
For example:
Sample1
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
Sample2
@RG\tID:group2\tSM:sample2\tPL:illumina\tLB:lib2\tPU:unit2
OR
Sample1
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
Sample2
@RG\tID:group1\tSM:sample2\tPL:illumina\tLB:lib1\tPU:unit1
Thank you
You should use the information from the sequence metadata you get with the sequence. LB is the library identifier and PU refers to the sequencing machine. The ID is usually a unique identifier for each lane.
new question(s) here after watching the BroadE videos.
Is markduplicate with marking but not removing command enough for HaplotypeCaller downstream processing as suggested by the sample commands or should one add the parameter to actually remove the duplicates
when removing the duplicates, does this keep one prototype read in each duplicate group and if yes what quality score does it store (median, max ...)
markduplicate is sufficient; HC will automatically apply filters to ignore dupes.
I forget what is the rule followed to keep a prototype read; it might be random but I'm not certain. This might be documented in the Picard docs (since it's techincally not a step done with GATK, we don't have docs on this).
Greetings!
I was wondering what kinds of errors could possibly cause the mark duplicate step to output a duplicates.metrics file where in the histogram is nearly-uniform? The input bam file was already sorted.
Thank you!
Stephanie
Hi Stephanie,
Sorry, that's a question for the authors of Picard tools, not us.
By the way, I would recommend not communicating text information in MS Word documents. If you later need to use the information elsewhere, Word can mess up the formatting and cause further issues. Also, other people may not necessarily have a license for MS Word, and may not be able to open the file. Using pure text formats is much better practice for this kind of information.
when I run the following command to sort the SAM file and convert it to BAM:
java -jar /usr/picard-tools-1.105/SortSam.jar INPUT=sample.sam OUTPUt=sample.sort.bam SORT_ORDER=coordinate
The terminal output the following things:
[Fri Dec 20 17:15:05 CST 2013] net.sf.picard.sam.SortSam INPUT=sample.sam OUTPUT=sample.sort.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
Java HotSpot(TM) Server VM warning: You have loaded library /usr/picard-tools-1.105/libIntelDeflater.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
[Fri Dec 20 17:15:05 CST 2013] Executing as [email protected] on Linux 3.2.0-57-generic i386; Java HotSpot(TM) Server VM 1.7.0_45-b18; Picard version: 1.105(1632) JdkDeflater
[Fri Dec 20 17:17:15 CST 2013] net.sf.picard.sam.SortSam done. Elapsed time: 2.16 minutes.
Runtime.totalMemory()=278396928
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.iterator(ArrayList.java:814)
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1064)
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1063)
at net.sf.samtools.SAMRecord.validateCigar(SAMRecord.java:1436)
at net.sf.samtools.SAMRecord.getCigar(SAMRecord.java:606)
at net.sf.samtools.SAMRecord.getCigarLength(SAMRecord.java:617)
at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1599)
at net.sf.samtools.SAMLineParser.parseLine(SAMLineParser.java:328)
at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:237)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:225)
at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:201)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:665)
at net.sf.picard.sam.SortSam.doWork(SortSam.java:68)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:179)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:120)
at net.sf.picard.sam.SortSam.main(SortSam.java:57)
And the BAM file called sample.sort.bam is null, and didn't contain anything, and it caused I could not finish the subsequent analysis. Why did it occured??? And in this case, what should I do???
Thanks!
Hi @binlangman,
This is an issue with Picard tools, not GATK. Please see the link included in the error message for getting help from the Picard team.
Thanks!
This is my command line:
bwa mem -M -R ’@RG\tID:cancer\tSM:cancer\tPL:illumina\tLB:cancer\tPU:unit1’ -p human_g1k_v37.fasta cancer_fwd.fastq cancer_rev.fastq -M > cancermem.sam
and I got this error:
[E::bwa_set_rg] the read group line is not started with @RG
Any idea what is the problem here?
Thank you for your help
Ok I fixed my problem...
Hi,
I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb).
But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage.
I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data.
And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why are simply marking it instead of removing it?
And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ?
Thank you.
Hi @anand_mt,
No that's not normal. Did you run MarkDuplicates on all the data together or per lane of data?
Basically it's because we don't like to discard data entirely. We prefer to leave them there just in case.
Effective coverage is the coverage after removing duplicates because that's what will be used for analysis, but it's a good idea to look at both before & after, and do a comparison.
I usually use samtools rmdup, but testing out MarkDuplicates.jar for the first time.
I am running this on 30x genome data, and everything appears to be going OK, except it is stuck at the "Traversing read pair information and detecting duplicates" for over 48 hours. Is that expected? I am using a machine with 64 1.4 Ghz cores and 512GB RAM (400 GB allocated to MarkDuplicates using -Xmx). Has it crashed or should I just wait patiently for it to finish?
My impression is that samtools rmdup is much faster, are there any downsides to using it instead?
Well, MarkDuplicates can be very slow. Maybe check the table of processes on your machine to see if it's still moving at all. If that's doesn't give you a clear answer, you may have better luck asking the Picard tools support folks (MarkDupes is not our tool so there's not that much we can say about it).
More of a samtools forum question, but it used to be samtools rmdup did not do chimeric molecule duplicate marking while MarkDuplicates.jar did...that is the only thing that I could remember. I have no idea if that is still the case. MarkingDuplicates on non-PCR free WGS libraries always seemed to stink IIRC (take a long time and use lots of RAM).
FYI we have a new implementation of MarkDupes in development that runs waaaaay faster. It will be a while before we can make it available, but just so you know there's hope for the future
Yay!
Good Morning,
I have a question regarding the alignment of reads. Previously these reads undergone a cleanup using Trimmomatic which resulted in two files pair-ends and two single-ends. How should I proceed in this case? Make an alignment for each single-end and use the "merge" the SAMTools?
I thank you and apologize for the difficulty with English.
Have a nice day,
Rhewter.
@Rhewter
Hi Rhewter,
I think your idea of making an alignment for each single-end and merging with SAMtools will work.
Good luck!
-Sheila
Thank you for this helpful tutorial! Is there any way to fill in the RG fields directly by reading from the FASTQ header containing information about flow cell, sample and library?
Thank you!
@Katie
Hi,
Unfortunately, there is no way to do that unless you write your own script.
-Sheila
Hi,
I am using the MarkDuplicates tool to remove PCR duplicates from ATAC-Seq data. I was wondering if you found the answer to @splaisan 's question from Oct 2013 - "when removing the duplicates, does this keep one prototype read in each duplicate group and if yes what quality score does it store (median, max ...)". I just want to be sure it keeps one protopye regardless of the criteria for selection.
The rmdup tool from samtools keeps the read with the best mapping quality and I was wondering how MarkDuplicates deals with this.
Thanks in advance!
Ramya
Hi @ramya88 I'm not sure; you should ask this on the Picard tools mailing list since it is not our tool.
Will do - thanks!
HI all
in Identify read group information this command is not running to me @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
terminal shows command not found. can some body explain what is best way to complete this tutorial with running condition ? i have downloaded sample with reference from bio planet.
@sadiq
Hi,
The line you posted is not a command. It is the line in the sam file that contains the read group and sample information.
This FAQ article will help you check if the line is properly formatted in your file: http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files
Take a look under 7. How can I tell if my BAM file has read group and sample information?
If the information is not properly there, you can use Picard's AddOrReplaceReadGroups: http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups
-Sheila
I have generated aligned_reads_sample1.sam i dont know what to do with it.I have raw file and refrence file and i want to use it via GATK. can somebody guide me with proper steps?? i m very new here.
When ever i follow this command java -jar SortSam.jar \ INPUT=aligned_read.sam \ OUTPUT = sorted_read.bam \ SORT_ORDER=coordinate
this gives me Error: Unable to access jarfile SortSam.jar.
What i am doing wrong ?
@sadiq
Hi,
You have not specified the proper path to the SortSam.jar. You need to specify where exactly the SortSam.jar is on your computer. This tutorial may help (under SAMtools): http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-software-packages-required-to-follow-the-gatk-best-practices#latest
-Sheila
Hi,
I saw that using a reference from the bundle is recommended as a best practice. In the hg19 bundle, the reference includes haplotype chromosomes (chr6_apd_hap1) and unplaced chromosomes (chrUn_gl000211). Is it a reasonable workflow to exclude these from the reference? Thus, I would align to chrM,1-22,X and Y and then continue with the Best Practice workflow for DNAseq.
Thank you,
Faith
We recommend including the non-canonical contigs when you are doing the initial mapping to the reference. Afterward you can exclude them when you are running the Best Practices workflow.
Hello All,
I am working on this command here is my command
bwa mem -M -R ’@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 ’ -p ucsc.hg19.fasta gcat_set_037_2.fastq > aligned_reads.sam
It genertes a currupt sam file. and terminal gives me this error.
[E::bwa_set_rg] the read group line is not started with @RG
Please direct me.
Thanks
bwa mem -M -R "@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1" -p ucsc.hg19.fasta gcat_set_037_2.fastq > aligned_reads.sam
after adding double quotes i am getting.
[bwt_restore_sa] fail to open file 'ucsc.hg19.fasta.sa' : No such file or directory
Can some body give me any clue regarding to my problem above please !
@sadiq For a problem like that you need to ask for support from the authors of BWA, not us, since it is not our tool. Or ask on SeqAnswers
I've met some problem while running Picard.
After I generated the sam file and put it in the Picard, it said:
**
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File aligned_reads5.sam; Line 1
Line: [M::bwa_idx_load_from_disk] read 0 ALT contigs
at htsjdk.samtools.SAMLineParser.reportFatalErrorParsingLine(SAMLineParser.java:427)
**
How to solve this problem?Please help me
The head of sam goes like this:
1 [M::bwa_idx_load_from_disk] read 0 ALT contigs
2 [M::process] read 111112 sequences (10000080 bp)...
3 [M::process] read 111112 sequences (10000080 bp)...
4 [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (4, 42523, 0, 1)
5 [M::mem_pestat] skip orientation FF as there are not enough pairs
6 [M::mem_pestat] analyzing insert size distribution for orientation FR...
7 [M::mem_pestat] (25, 50, 75) percentile: (158, 199, 255)
8 [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 449)
9 [M::mem_pestat] mean and std.dev: (211.41, 71.12)
10 [M::mem_pestat] low and high boundaries for proper pairs: (1, 546)
11 [M::mem_pestat] skip orientation RF as there are not enough pairs
12 [M::mem_pestat] skip orientation RR as there are not enough pairs
13 [M::mem_process_seqs] Processed 111112 reads in 33.378 CPU sec, 33.243 real sec
14 @SQ SN:chrM LN:16571
15 @SQ SN:chr1 LN:249250621
16 @SQ SN:chr2 LN:243199373
17 @SQ SN:chr3 LN:198022430
18 @SQ SN:chr4 LN:191154276
the command I enter is :
nohup bwa mem -M ucsc.hg19.fasta pe_1.fq pe_2.fq > aligned_reads6.sam
java -Xmx15g -jar /home/zyw/picard-tools-1.131/picard.jar SortSam \INPUT=aligned_reads5.sam \OUTPUT=sorted_reads.bam \SORT_ORDER=coordinate
@WenhsunWu Your SAM file is contaminated by log messages. There must be an error in the script you are using (probably an improper use of pipes). You must correct your script or run the mapping step manually to produce a valid SAM file.
@Geraldine_VdAuwera
Thanks for answering me!
I've followed the steps strictly one by one and tried many times but it just didn't work.
So is that a problem about BWA itself?Can I simply delete the beginning rows?
@WenhsunWu Can you describe how you run the tools, and what command line you used?
@Geraldine_VdAuwera
Yes.
bwa index -a ucsc.hg19.fasta
I ran all this on Debian 3.2.65-1
The bwa version I used is 0.7.12-r1039
samtools version is 0.1.18 (r982:295)
picard version is 1.131
@WenhsunWu
Try running the bwa mem command without using nohup. As @Geraldine_VdAuwera says your SAM file is being contaminated with bwa's logging messages (stderr and stdout are both being redirected to file) your command is saying to redirect stdout to file (which is what you are supposed to do), but I am wondering if the reason that stderr is going to file as well is because you are nohupping the command.
Yeah that would be my guess too.
@Kurt
@Geraldine_VdAuwera
Yeah you two are right! I've figure out how to do that. Thank you!
Hi,
I have a question about the read group identifier. Is it essential to keep unique read group identifiers for each sample when I do analysis with GATK?
Usually what I use like this,
sample 1: @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1
sample 2: @RG\tID:group1\tSM:sample2\tPL:illumina\tLB:lib1\tPU:unit1
Or I should put the same name for the ID, SM, LB, PU and all other field if any. Like this,
sample 1: @RG\tID:sample1\tSM:sample1\tPL:illumina\tLB:sample1\tPU:sample1
sample 2: @RG\tID:sample2\tSM:sample2\tPL:illumina\tLB:sample2\tPU:sample2
I generally do a single sample analysis each time with BWA, Picard and GATK and then add all those samples in g.vcf. Basically I want to know if this kind of read group identifier information causes any problem to GATK in down stream analysis like sample group identification or any other.
P.S. As most of the time service providers do not provide the sample metadata information alongwith the raw files I had to simply put these kind of RG information while doing data analysis following GATK best practices.
This may be a silly Q but I really need an answer.
Thanks in advance...
Any reply to my queries!!
@aneek
Hi,
This article should clarify any questions you have: https://www.broadinstitute.org/gatk/guide/article?id=1317
-Sheila
Hi @Sheila,
Thanks a lot for the information. I have another query. Once I add @RG informations during alignment through BWA and mark duplicates Picardtools, is it necessary again to do the AddorReplaceReadGroups step of Picardtools before going for Indel realignment through GATK?
I think this step should be optional and should only be used if no @RG informations are added during BWA alignment or to modify the @RG informations in the BAM file. Please correct me if I am wrong.
Thanks..
@aneek
Hi,
You are correct you do not need to use Picard's AddOrReplaceReadGroups if you already have Read Group information you are happy with in the bam files.
-Sheila
@Sheila
Thank you very much for the suggestion.
Hello,
When I validated the SAM file I have found following error messages
ERROR: Record 145857980, Read name HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660, Read length does not match quals length
WARNING: Read name HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660, A record is missing a read group
WARNING: Record 145857980, Read name HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660, NM tag (nucleotide differences) is missing
Please suggest how to correct or remove or ignore these errors in my SAM file.
Thank you.
These errors are very alarming, especially the first one. How are you processing your data?
@Geraldine_VdAuwera
Thanks for quick response. I was processing Illumina paired end data. I had used bwa mem to create the SAM file from both the fastq files (R1&R2). The commandlines I have used for this are as follows,
bwa mem -M -R "@RG\tID:SHAREY\tLB:SHAREY\tSM:SHAREY\tPL:ILLUMINA" hg19 Sharey_R1.fastq Sharey_R2.fastq > Sharey.sam
Then I was using picardtools fro SAM to BAM conversion using following commandlines,
java -jar picard.jar SortSam INPUT=Sharey.sam OUTPUT=Sharey_sorted.bam SORT_ORDER=coordinate
Now I am stuck in this step because this generates the above mentioned parsing error messages instead of generating a BAM file.
Please suggest how to solve this problem.
Hi @Geraldine_VdAuwera ,
For your information the parsing error message after SortSAM is as follows,
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /share/apps/HumanExome/Sharey.sam; Line 145858075
Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII
at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:439)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:324)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
at picard.sam.SortSam.doWork(SortSam.java:81)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Any suggestion to my query..
@aneek
Hi,
Try SamFormatConverter to conver the sam file to bam file. http://broadinstitute.github.io/picard/command-line-overview.html#SamFormatConverter
-Sheila
@aneek
Hi again,
Please ignore my previous comment. It turns out you need to sanitize your bam file. You can do this with RevertSam. http://broadinstitute.github.io/picard/command-line-overview.html#RevertSam There is an option to sanitize which will remove reads that are causing the error.
-Sheila
@Sheila
Thank you very much for suggestion. I have used the RevertSam function but the still the problem persists. I have used the following commandline in picardtools,
java -jar picard.jar RevertSam INPUT=Sharey.sam OUTPUT=Sharey_sanitized.sam SANITIZE=true
The program is getting finished with following error messages instead of generating any sanitized (error free) SAM file.
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /share/apps/HumanExome/Sharey.sam; Line 145858075
Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII
Please help.
Issue · Github
by Sheila
Hello,
Any further suggestion please...
Try CleanSam
@ Geraldine
Thanks for the suggestion. I have tried CleanSam too. It is generating the cleaned SAM file b ignoring the parsing errors i.e.
Ignoring SAM validation error due to lenient parsing:
Error parsing text SAM file. length(QUAL) != length(SEQ); File /share/apps/HumanExome/Sharey.sam; Line 145858075
Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII
Ignoring SAM validation error due to lenient parsing:
Error parsing text SAM file. Read length does not match quals length; File /share/apps/HumanExome/Sharey.sam; Line 145858075
Line: HWI-700463F:78:H3VCJBCXX:2:2109:9414:82660 147 chr7 4208503860 100M = 42084962 -176 TGAGGGAATAATGTCTGCATAGGGGCTGCGCTGGCCAGTTAGCAGGGCCATCTGATGATAGTATTCTGCTGGGCTGACTCCTGCATGGGGCGCTAGGAGG GIIIIGIIGGGIIGIIGGGIIIIIGIIGIIIIIIGIIIIIGIIIGIIIIIGIIIGIIGIIIIGIGGIIGGGIIIIGGGAIII
[Fri Nov 13 16:04:53 IST 2015] picard.sam.CleanSam done. Elapsed time: 27.94 minutes.
Runtime.totalMemory()=440401920
However while creating the sorted BAM file from this SAM file using SortSam, giving the same error messages without generating any output. Should I keep the Validation Stringency Lenient?
What happens when you set validation to strict?
@Geraldine_VdAuwera
Hi,
Keeping the validation stringency strict also giving the same error messages like earlier.
Any further suggestion?
Issue · Github
by Sheila
@aneek
Hi,
If the error is caused by just one read, you can remove it manually. If there is more than one read causing the error, you can redo the mapping and make sure to do it with the latest version of the mapper. This sounds like a bug that may have been fixed in the latest version.
-Sheila
Identify read group info for multiplexed samples
I am familiar with the situation where I have multiple libraries of the same sample or when we run the same library on multiple lanes. My question is about multiplexing.
Let us say I have 2 samples (S1 and S2). I prepared both samples at the same time for paired end sequencing 2x100 with 400bp insert size then indexed and pooled. However I should give each sample a unique LB tag, right?
Then the pooled sample was sequenced on the same lane. I should give the sequence from each sample a different ID tag but the PU tag should be the same for both samples, right?
These are the header lines of the 1st read in the R1 file of each sample:
@D3VG1JS1:214:C7RNWACXX:1:1101:1128:1956 1:N:0:ATTCCT
@D3VG1JS1:214:C7RNWACXX:1:1101:1088:1987 1:N:0:GATCAG
My proposed read info for the 1st sample should be:
"@RG\tID:@D3VG1JS1.214.C7RNWACXX.1.sample1\tSM:S1\tPL:Illumina\tLB:Lib1\tPU:@D3VG1JS1.214.C7RNWACXX.1"
My proposed read info for the 2nd sample should be:
"@RG\tID:@D3VG1JS1.214.C7RNWACXX.1.sample2\tSM:S2\tPL:Illumina\tLB:Lib2\tPU:@D3VG1JS1.214.C7RNWACXX.1"
I have revised the sam file specification (https://samtools.github.io/hts-specs/SAMv1.pdf). Also I found these links on the GATK related material but non gave me a clear answer
https://www.broadinstitute.org/gatk/guide/article?id=3059
https://www.broadinstitute.org/gatk/guide/article?id=6472 (Unfortunately, very confusing)
https://www.broadinstitute.org/gatk/guide/article?id=1317
Thank you
@drtamermansour
Hi,
Yes to all of your questions. You are correct about giving the samples different libraries and IDs, but the same PU tag.
-Sheila
Thank you
Hi,
The problem still persists. You said it may be a bug and will be fixed in the latest versions. I am using bwa version 0.7.12 for mapping and picardtools version 1.141 for sorting. I have also used stampy version 1.0.28 for mapping, giving different parsing errors than bwa while sorting through picardtools,

at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:217)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
at picard.sam.SortSam.doWork(SortSam.java:81)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95))
Please suggest.
Issue · Github
by Sheila
@aneek
Hi,
I'm not sure what is happening to your files. It seems like you are somehow getting formatting errors, and we cannot help easily with those. The only suggestion for now is to upgrade to the latest version of Picard. The one you are using is quite old.
-Sheila
Hi,
I have a problem by using picard.jar MarkDuplicates in R.
I decided to follow the workflow in this paper: https://www.bioconductor.org/help/workflows/chipseqDB/
As above example, I got an error message while I tried to use MarkDuplicates.
When I import the following command in R:
temp.bam <- "h3k9ac_temp.bam"
temp.file <- "h3k9ac_metric.txt"
temp.dir <- "h3k9ac_working"
dir.create(temp.dir)
for (bam in bam.files) {
code <- system(sprintf("MarkDuplicates I=%s O=%s M=%s \
TMP_DIR=%s AS=true REMOVE_DUPLICATES=false \
VALIDATION_STRINGENCY=SILENT", bam, temp.bam,
temp.file, temp.dir))
stopifnot(code==0L)
file.rename(temp.bam, bam) }
I got:
sh: MarkDuplicates:Command not found
I have already checked the JAVA version.
And both of the following command can also be done.
java -jar /path/to/picard.jar –h
java -jar $PICARD –h
So I am wondering why I got this trouble in R.
Did I miss something else?
I'm not good at computer, please help me~~~
@AbbieHsu
Hi,
Are you running on a Windows machine by any chance? Can you try running the command in a regular linux terminal? I don't think we can help you with what you are trying to do. Maybe you can try asking the authors of the paper for help.
-Sheila
Hi,
I'm running the command in a regular linux terminal. Thanks for your kindly reply. I'll ask the authors for help
Hi all,
Actually I'm new to GATK and bioinformatics. Nowadays I doing 16S analysis with iron-torrent data. Now I have sam file that align with silva db as reference using Bowtie2. Could you help me what would be the next step that I could follow to remove duplicates form sam file?
@Ybanet
HI,
Have a look at the Best Practices and this new document which replaces this one.
-Sheila
Hello,
Our sample consists of a pool of purified viral genomes from a DNA virus that we have put under selective pressure and passaged in bulk, essentially a quasispecies of mutated viral genomes. As such, duplicates could originate from different templates within the same sample. Would you still recommend marking duplicates in this scenario?
@KarinaDiaz
Hi,
I think this thread will help you.
-Sheila