What's doing the MarkDuplicates?

cristianro87cristianro87 ArgentinaPosts: 10Member
edited March 2014 in Ask the GATK team


i have a ION PGM sequencing, i follow the best practices to do the variant calling

my command line:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:IONTORRENT\tLB:lib1\tPU:unit1' /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa 1.fq.gz > 1_aligned_reads.sam

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/SortSam.jar INPUT=1_aligned_reads.sam OUTPUT=1_sorted_reads.bam SORT_ORDER=coordinate

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/MarkDuplicates.jar INPUT=1_sorted_reads.bam OUTPUT=1_dedup_reads.bam METRICS_FILE=1_metrics.txt

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/BuildBamIndex.jar INPUT=1_dedup_reads.bam

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_target_intervals.list

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T IndelRealigner -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -targetIntervals 1_target_intervals.list -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_realigned_reads.bam

after the IndelRealigner step, i check the sorted_reads.bam and the bam i get with the IndelRealigner on IGV...

in the position that show the image, after the realignment only 5 reads are keep, the question is why all the reads that have the variant in the reverse strand are gone?

i don't understend, these reads are placed somewhere else in the alignment?

2038 x 1100 - 50K
Post edited by cristianro87 on

Best Answer


  • cristianro87cristianro87 ArgentinaPosts: 10Member
    edited March 2014

    i check, and actually in the dedup_reads.bam file looks like the realigned_reads.bam, so this reads are filtered in the MarkDuplicates step.. but i can't see why, if they have a variant actually

    Post edited by cristianro87 on
  • TechnicalVaultTechnicalVault Cambridge, UKPosts: 110Member ✭✭✭

    The reads that seem to have been filtered do rather look like PCR duplicates. Was this a very low diversity custom pulldown library by any chance?

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

  • cristianro87cristianro87 ArgentinaPosts: 10Member

    Dear TechnicalVault, thanks for your answer,
    i understand that PCR products must be cleaned, but why the MarkDuplicates algorithm keeps as a good variant the red variant 'T' observed in a single read in the alignment, and not the 'A' observed more than once

    thanks in advance


    2038 x 1100 - 51K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    Hi Cristian,

    MarkDuplicates does not look at variants or how often they are represented. It only looks at the starting position and the CIGAR string of the reads. If the start position and CIGAR are identical, then the reads are duplicates. So the program only leaves one read untouched and marks the others so they can be ignored in downstream analyses. If the multiple observations of 'A' are on duplicate reads, then they are artifacts and you shouldn't use them as supporting evidence for a variant call.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Cambridge, UKPosts: 110Member ✭✭✭

    I do recall one group that did something similar added a random 3 base barcode tag to their templates prior to amplification. The random tag was then removed after sequencing and turned into a BAM tag could then be used to distinguish the PCR dups from those that were dups by chance. This required a special version of the mark-dup tool though.

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

Sign In or Register to comment.