US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

What's doing the MarkDuplicates?

cristianro87cristianro87 ArgentinaPosts: 5Member
edited March 28 in Ask the GATK team

Hello,

i have a ION PGM sequencing, i follow the best practices to do the variant calling

my command line:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:IONTORRENT\tLB:lib1\tPU:unit1' /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa 1.fq.gz > 1_aligned_reads.sam

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/SortSam.jar INPUT=1_aligned_reads.sam OUTPUT=1_sorted_reads.bam SORT_ORDER=coordinate

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/MarkDuplicates.jar INPUT=1_sorted_reads.bam OUTPUT=1_dedup_reads.bam METRICS_FILE=1_metrics.txt

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/BuildBamIndex.jar INPUT=1_dedup_reads.bam

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_target_intervals.list

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T IndelRealigner -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -targetIntervals 1_target_intervals.list -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_realigned_reads.bam

after the IndelRealigner step, i check the sorted_reads.bam and the bam i get with the IndelRealigner on IGV...

in the position that show the image, after the realignment only 5 reads are keep, the question is why all the reads that have the variant in the reverse strand are gone?

i don't understend, these reads are placed somewhere else in the alignment?

gatk.png
2038 x 1100 - 50K
Post edited by cristianro87 on
Tagged:

Best Answer

Answers

  • cristianro87cristianro87 ArgentinaPosts: 5Member
    edited March 28

    i check, and actually in the dedup_reads.bam file looks like the realigned_reads.bam, so this reads are filtered in the MarkDuplicates step.. but i can't see why, if they have a variant actually

    Post edited by cristianro87 on
  • TechnicalVaultTechnicalVault Sanger, Cambridge, UKPosts: 81Member

    The reads that seem to have been filtered do rather look like PCR duplicates. Was this a very low diversity custom pulldown library by any chance?

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • cristianro87cristianro87 ArgentinaPosts: 5Member

    Dear TechnicalVault, thanks for your answer, i understand that PCR products must be cleaned, but why the MarkDuplicates algorithm keeps as a good variant the red variant 'T' observed in a single read in the alignment, and not the 'A' observed more than once

    thanks in advance

    Cristian

    gatk2.png
    2038 x 1100 - 51K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi Cristian,

    MarkDuplicates does not look at variants or how often they are represented. It only looks at the starting position and the CIGAR string of the reads. If the start position and CIGAR are identical, then the reads are duplicates. So the program only leaves one read untouched and marks the others so they can be ignored in downstream analyses. If the multiple observations of 'A' are on duplicate reads, then they are artifacts and you shouldn't use them as supporting evidence for a variant call.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Sanger, Cambridge, UKPosts: 81Member

    I do recall one group that did something similar added a random 3 base barcode tag to their templates prior to amplification. The random tag was then removed after sequencing and turned into a BAM tag could then be used to distinguish the PCR dups from those that were dups by chance. This required a special version of the mark-dup tool though.

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

Sign In or Register to comment.