The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# What's doing the MarkDuplicates?

ArgentinaMember
edited March 2014

Hello,

i have a ION PGM sequencing, i follow the best practices to do the variant calling

my command line:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:IONTORRENT\tLB:lib1\tPU:unit1' /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa 1.fq.gz > 1_aligned_reads.sam

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_target_intervals.list

after the IndelRealigner step, i check the sorted_reads.bam and the bam i get with the IndelRealigner on IGV...

in the position that show the image, after the realignment only 5 reads are keep, the question is why all the reads that have the variant in the reverse strand are gone?

i don't understend, these reads are placed somewhere else in the alignment?

Tagged:

• ArgentinaMember
edited March 2014

i check, and actually in the dedup_reads.bam file looks like the realigned_reads.bam, so this reads are filtered in the MarkDuplicates step.. but i can't see why, if they have a variant actually

• Cambridge, UKMember

The reads that seem to have been filtered do rather look like PCR duplicates. Was this a very low diversity custom pulldown library by any chance?

• ArgentinaMember

i understand that PCR products must be cleaned, but why the MarkDuplicates algorithm keeps as a good variant the red variant 'T' observed in a single read in the alignment, and not the 'A' observed more than once

Cristian

Hi Cristian,

MarkDuplicates does not look at variants or how often they are represented. It only looks at the starting position and the CIGAR string of the reads. If the start position and CIGAR are identical, then the reads are duplicates. So the program only leaves one read untouched and marks the others so they can be ignored in downstream analyses. If the multiple observations of 'A' are on duplicate reads, then they are artifacts and you shouldn't use them as supporting evidence for a variant call.

• Cambridge, UKMember

I do recall one group that did something similar added a random 3 base barcode tag to their templates prior to amplification. The random tag was then removed after sequencing and turned into a BAM tag could then be used to distinguish the PCR dups from those that were dups by chance. This required a special version of the mark-dup tool though.

• Warwick University, CoventryMember

I was just wondering what the ramifications are of using "samtools rmdup" instead of MarkDuplicates in Picard? The specific reason is that I have a pipeline where a number of the steps are carried out in samtools with the variants called in VarScan. It would be inconvenient at this stage to do variant calling in a program other than VarScan; because we have our own program that interprets the VarScan output (although it probably should have been written to understand VCF from the start). Anyway, I'm now interested in using GATK's local re-alignment around indels and possibly GATK's base recalibration before calling variants in VarScan. I'm just wondering if that is safe with BAM files processed with "samtools rmdup" (after "samtools sort") instead of Picard's MarkDuplicates.

• Warwick University, CoventryMember

I probably posted this in the wrong thread - I noticed that there's another short thread comparing "samtools rmdup" and Picard's MarkDuplicates. Anyway, another issue is that VarScan needs output from "samtools mpileup" which in turn may not understand "marked duplicates" from Picard; so maybe I have to use "samtools rmdup" for that reason.