The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# What's doing the MarkDuplicates?

ArgentinaMember Posts: 10
edited March 2014

Hello,

i have a ION PGM sequencing, i follow the best practices to do the variant calling

my command line:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:IONTORRENT\tLB:lib1\tPU:unit1' /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa 1.fq.gz > 1_aligned_reads.sam

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_target_intervals.list

after the IndelRealigner step, i check the sorted_reads.bam and the bam i get with the IndelRealigner on IGV...

in the position that show the image, after the realignment only 5 reads are keep, the question is why all the reads that have the variant in the reverse strand are gone?

i don't understend, these reads are placed somewhere else in the alignment?

Tagged:

• ArgentinaMember Posts: 10
edited March 2014

i check, and actually in the dedup_reads.bam file looks like the realigned_reads.bam, so this reads are filtered in the MarkDuplicates step.. but i can't see why, if they have a variant actually

• Cambridge, UKMember Posts: 111 ✭✭✭

The reads that seem to have been filtered do rather look like PCR duplicates. Was this a very low diversity custom pulldown library by any chance?

Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

• ArgentinaMember Posts: 10

i understand that PCR products must be cleaned, but why the MarkDuplicates algorithm keeps as a good variant the red variant 'T' observed in a single read in the alignment, and not the 'A' observed more than once

Cristian

Hi Cristian,

MarkDuplicates does not look at variants or how often they are represented. It only looks at the starting position and the CIGAR string of the reads. If the start position and CIGAR are identical, then the reads are duplicates. So the program only leaves one read untouched and marks the others so they can be ignored in downstream analyses. If the multiple observations of 'A' are on duplicate reads, then they are artifacts and you shouldn't use them as supporting evidence for a variant call.

Geraldine Van der Auwera, PhD

• Cambridge, UKMember Posts: 111 ✭✭✭

I do recall one group that did something similar added a random 3 base barcode tag to their templates prior to amplification. The random tag was then removed after sequencing and turned into a BAM tag could then be used to distinguish the PCR dups from those that were dups by chance. This required a special version of the mark-dup tool though.

Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

• Warwick University, CoventryMember Posts: 18

I was just wondering what the ramifications are of using "samtools rmdup" instead of MarkDuplicates in Picard? The specific reason is that I have a pipeline where a number of the steps are carried out in samtools with the variants called in VarScan. It would be inconvenient at this stage to do variant calling in a program other than VarScan; because we have our own program that interprets the VarScan output (although it probably should have been written to understand VCF from the start). Anyway, I'm now interested in using GATK's local re-alignment around indels and possibly GATK's base recalibration before calling variants in VarScan. I'm just wondering if that is safe with BAM files processed with "samtools rmdup" (after "samtools sort") instead of Picard's MarkDuplicates.

• Warwick University, CoventryMember Posts: 18

I probably posted this in the wrong thread - I noticed that there's another short thread comparing "samtools rmdup" and Picard's MarkDuplicates. Anyway, another issue is that VarScan needs output from "samtools mpileup" which in turn may not understand "marked duplicates" from Picard; so maybe I have to use "samtools rmdup" for that reason.

@WVNicholson We don't recommend samtools rmdup because it is not as thorough at marking duplicates as Picard MarkDuplicates. For the rest, you should be able to run any combination of these tools; GATK tools will take any valid bam file as input, and I expect samtools mpileup will do the same. We and the samtools organization are on the same wavelength when it comes to standardizing inputs and outputs.

If it's at all an option I would encourage you to switch to GATK for variant calling and simply find or write a converter to convert the VCF to whatever VarScan output looks like... but I'm obviously biased

Geraldine Van der Auwera, PhD

• Cambridge, UKMember Posts: 111 ✭✭✭
edited March 2016

@WVNicholson The Samtools maintainers do not recommend the use of samtools rmdup, we kept it around because people occasionally need to duplicate old pipelines. Basically rmdup only considers each read in isolation for determining duplicate status and I still haven't got around to finishing my fix for it. If you want speed, I recommend biobambam which was designed to mark the same duplicates as Picard. Also samtools mpileup understands Picard marked duplicates perfectly well, it's a standard SAM flag.

Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University