Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

(howto) Map and mark duplicates

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin
edited July 30 in Tutorials

Objective

Map the read data to the reference and mark duplicates.

Prerequisites

  • This tutorial assumes adapter sequences have been removed.

Steps

  1. Identify read group information
  2. Generate a SAM file containing aligned reads
  3. Convert to BAM, sort and mark duplicates

1. Identify read group information

The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.

Action

Compose the read group identifier in the following format:

@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 

where the \t stands for the tab character.


2. Generate a SAM file containing aligned reads

Action

Run the following BWA command:

In this command, replace read group info by the read group identifier composed in the previous step.

bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam 

replacing the <read group info> bit with the read group identifier you composed at the previous step.

The -M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).

Expected Result

This creates a file called aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.

Note that here we are using a command that is specific for pair ended data in an interleaved fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.


3. Convert to BAM, sort and mark duplicates

These initial pre-processing operations format the data to suit the requirements of the GATK tools.

Action

Run the following Picard command to sort the SAM file and convert it to BAM:

java -jar SortSam.jar \ 
    INPUT=aligned_reads.sam \ 
    OUTPUT=sorted_reads.bam \ 
    SORT_ORDER=coordinate 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

Action

Run the following Picard command to mark duplicates:

java -jar MarkDuplicates.jar \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam \
    METRICS_FILE=metrics.txt

Expected Result

This creates a sorted BAM file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt containing (can you guess?) metrics.

Action

Run the following Picard command to index the BAM file:

java -jar BuildBamIndex.jar \ 
    INPUT=dedup_reads.bam 

Expected Result

This creates an index file for the BAM file called dedup_reads.bai.

Post edited by Sheila on

Geraldine Van der Auwera, PhD

Comments

  • michaelchaomichaelchao Posts: 13Member

    Hi,

    Following this week’s workshop, I have been trying the pipeline on my own data. Currently I am generating a SAM file containing aligned reads from pair-ended in forward/reverse read files:

    bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' human_g1k_v37.fasta sample1_1.fq sample1_2.fq > aligned_reads_sample1.sam

    There is a lot of: skip orientation “” as there are not enough pairs (please see below). Does this indicate something is not running correctly?

    [M::main_mem] read 100000 sequences (10000000 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 41485, 1, 1) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (122, 168, 239) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 473) [M::mem_pestat] mean and std.dev: (186.71, 84.97) [M::mem_pestat] low and high boundaries for proper pairs: (1, 590) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::worker2@0] performed mate-SW for 58304 reads

    Thanks

  • pdexheimerpdexheimer Posts: 343Member, GSA Collaborator ✭✭✭

    Nope, looks right to me. The vast majority of your reads should be in FR orientation for standard paired end sequencing. It would be a problem if you did have a lot of reads in the other orientations

  • michaelchaomichaelchao Posts: 13Member

    Thanks. I am now running Picard to mark duplicates using the command above:

    java -jar MarkDuplicates.jar \ INPUT=sorted_reads.bam \ OUTPUT=dedup_reads.bam

    But received the following error:

    ERROR: Option 'METRICS_FILE' is required.

  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin

    The error message is nice and clear. You need to provide it a metrics file. I suggest: M=dedup_reads.bam.metrics

  • michaelchaomichaelchao Posts: 13Member

    Hi I have a question about the read group identifier:

    @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1

    I have 60 exome sequencing samples, and for each sample I need to modify the tSM accordingly (e.g. tSM:sample1 to sample100). Do I also need to change the tID, tLB and tPU for each sample (make them unique) or are they the same for all samples?

    For example:

    Sample1 @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 Sample2 @RG\tID:group2\tSM:sample2\tPL:illumina\tLB:lib2\tPU:unit2

    OR

    Sample1 @RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 Sample2 @RG\tID:group1\tSM:sample2\tPL:illumina\tLB:lib1\tPU:unit1

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    You should use the information from the sequence metadata you get with the sequence. LB is the library identifier and PU refers to the sequencing machine. The ID is usually a unique identifier for each lane.

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 14Member

    new question(s) here after watching the BroadE videos.

    • Is markduplicate with marking but not removing command enough for HaplotypeCaller downstream processing as suggested by the sample commands or should one add the parameter to actually remove the duplicates

    • when removing the duplicates, does this keep one prototype read in each duplicate group and if yes what quality score does it store (median, max ...)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin
    • markduplicate is sufficient; HC will automatically apply filters to ignore dupes.

    • I forget what is the rule followed to keep a prototype read; it might be random but I'm not certain. This might be documented in the Picard docs (since it's techincally not a step done with GATK, we don't have docs on this).

    Geraldine Van der Auwera, PhD

  • stechenstechen University of PennsylvaniaPosts: 23Member

    Greetings!

    I was wondering what kinds of errors could possibly cause the mark duplicate step to output a duplicates.metrics file where in the histogram is nearly-uniform? The input bam file was already sorted.

    Thank you! Stephanie

    docx
    docx
    duplicates.metrics.docx
    12K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Hi Stephanie,

    Sorry, that's a question for the authors of Picard tools, not us.

    By the way, I would recommend not communicating text information in MS Word documents. If you later need to use the information elsewhere, Word can mess up the formatting and cause further issues. Also, other people may not necessarily have a license for MS Word, and may not be able to open the file. Using pure text formats is much better practice for this kind of information.

    Geraldine Van der Auwera, PhD

  • binlangmanbinlangman 湖南长沙Posts: 5Member

    when I run the following command to sort the SAM file and convert it to BAM: java -jar /usr/picard-tools-1.105/SortSam.jar INPUT=sample.sam OUTPUt=sample.sort.bam SORT_ORDER=coordinate

    The terminal output the following things: [Fri Dec 20 17:15:05 CST 2013] net.sf.picard.sam.SortSam INPUT=sample.sam OUTPUT=sample.sort.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false Java HotSpot(TM) Server VM warning: You have loaded library /usr/picard-tools-1.105/libIntelDeflater.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'. [Fri Dec 20 17:15:05 CST 2013] Executing as binlangman@binlangman-Lenovo-G450 on Linux 3.2.0-57-generic i386; Java HotSpot(TM) Server VM 1.7.0_45-b18; Picard version: 1.105(1632) JdkDeflater [Fri Dec 20 17:17:15 CST 2013] net.sf.picard.sam.SortSam done. Elapsed time: 2.16 minutes. Runtime.totalMemory()=278396928 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.iterator(ArrayList.java:814) at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1064) at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1063) at net.sf.samtools.SAMRecord.validateCigar(SAMRecord.java:1436) at net.sf.samtools.SAMRecord.getCigar(SAMRecord.java:606) at net.sf.samtools.SAMRecord.getCigarLength(SAMRecord.java:617) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1599) at net.sf.samtools.SAMLineParser.parseLine(SAMLineParser.java:328) at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:237) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:225) at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:201) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:665) at net.sf.picard.sam.SortSam.doWork(SortSam.java:68) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:179) at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:120) at net.sf.picard.sam.SortSam.main(SortSam.java:57)

    And the BAM file called sample.sort.bam is null, and didn't contain anything, and it caused I could not finish the subsequent analysis. Why did it occured??? And in this case, what should I do???

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Hi @binlangman,

    This is an issue with Picard tools, not GATK. Please see the link included in the error message for getting help from the Picard team.

    Geraldine Van der Auwera, PhD

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    This is my command line: bwa mem -M -R ’@RG\tID:cancer\tSM:cancer\tPL:illumina\tLB:cancer\tPU:unit1’ -p human_g1k_v37.fasta cancer_fwd.fastq cancer_rev.fastq -M > cancermem.sam

    and I got this error: [E::bwa_set_rg] the read group line is not started with @RG

    Any idea what is the problem here?

    Thank you for your help

  • frankibfrankib Sherbrooke, CanadaPosts: 41Member

    @frankib said: This is my command line: bwa mem -M -R ’@RG\tID:cancer\tSM:cancer\tPL:illumina\tLB:cancer\tPU:unit1’ -p human_g1k_v37.fasta cancer_fwd.fastq cancer_rev.fastq -M > cancermem.sam

    and I got this error: [E::bwa_set_rg] the read group line is not started with RG

    Any idea what is the problem here?

    Thank you for your help

    Ok I fixed my problem...

  • anand_mtanand_mt Posts: 3Member

    Hi,

    I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb).

    But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage.

    1. I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data.

    2. And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why are simply marking it instead of removing it?

    3. And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ?

    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Hi @anand‌_mt,

    1. No that's not normal. Did you run MarkDuplicates on all the data together or per lane of data?

    2. Basically it's because we don't like to discard data entirely. We prefer to leave them there just in case.

    3. Effective coverage is the coverage after removing duplicates because that's what will be used for analysis, but it's a good idea to look at both before & after, and do a comparison.

    Geraldine Van der Auwera, PhD

  • jur2014jur2014 New York NYPosts: 1Member

    I usually use samtools rmdup, but testing out MarkDuplicates.jar for the first time. I am running this on 30x genome data, and everything appears to be going OK, except it is stuck at the "Traversing read pair information and detecting duplicates" for over 48 hours. Is that expected? I am using a machine with 64 1.4 Ghz cores and 512GB RAM (400 GB allocated to MarkDuplicates using -Xmx). Has it crashed or should I just wait patiently for it to finish?

    My impression is that samtools rmdup is much faster, are there any downsides to using it instead?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Well, MarkDuplicates can be very slow. Maybe check the table of processes on your machine to see if it's still moving at all. If that's doesn't give you a clear answer, you may have better luck asking the Picard tools support folks (MarkDupes is not our tool so there's not that much we can say about it).

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 152Member ✭✭✭

    More of a samtools forum question, but it used to be samtools rmdup did not do chimeric molecule duplicate marking while MarkDuplicates.jar did...that is the only thing that I could remember. I have no idea if that is still the case. MarkingDuplicates on non-PCR free WGS libraries always seemed to stink IIRC (take a long time and use lots of RAM).

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    FYI we have a new implementation of MarkDupes in development that runs waaaaay faster. It will be a while before we can make it available, but just so you know there's hope for the future ;)

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.