What's doing the MarkDuplicates?

cristianro87cristianro87 ArgentinaPosts: 10Member
edited March 2014 in Ask the GATK team

Hello,

i have a ION PGM sequencing, i follow the best practices to do the variant calling

my command line:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:IONTORRENT\tLB:lib1\tPU:unit1' /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa 1.fq.gz > 1_aligned_reads.sam

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/SortSam.jar INPUT=1_aligned_reads.sam OUTPUT=1_sorted_reads.bam SORT_ORDER=coordinate

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/MarkDuplicates.jar INPUT=1_sorted_reads.bam OUTPUT=1_dedup_reads.bam METRICS_FILE=1_metrics.txt

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/BuildBamIndex.jar INPUT=1_dedup_reads.bam

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_target_intervals.list

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T IndelRealigner -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -targetIntervals 1_target_intervals.list -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_realigned_reads.bam

after the IndelRealigner step, i check the sorted_reads.bam and the bam i get with the IndelRealigner on IGV...

in the position that show the image, after the realignment only 5 reads are keep, the question is why all the reads that have the variant in the reverse strand are gone?

i don't understend, these reads are placed somewhere else in the alignment?

gatk.png
2038 x 1100 - 50K
Post edited by cristianro87 on
Tagged:

Best Answer

Answers

  • cristianro87cristianro87 ArgentinaPosts: 10Member
    edited March 2014

    i check, and actually in the dedup_reads.bam file looks like the realigned_reads.bam, so this reads are filtered in the MarkDuplicates step.. but i can't see why, if they have a variant actually

    Post edited by cristianro87 on
  • TechnicalVaultTechnicalVault Cambridge, UKPosts: 111Member ✭✭✭

    The reads that seem to have been filtered do rather look like PCR duplicates. Was this a very low diversity custom pulldown library by any chance?

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

  • cristianro87cristianro87 ArgentinaPosts: 10Member

    Dear TechnicalVault, thanks for your answer,
    i understand that PCR products must be cleaned, but why the MarkDuplicates algorithm keeps as a good variant the red variant 'T' observed in a single read in the alignment, and not the 'A' observed more than once

    thanks in advance

    Cristian

    gatk2.png
    2038 x 1100 - 51K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    Hi Cristian,

    MarkDuplicates does not look at variants or how often they are represented. It only looks at the starting position and the CIGAR string of the reads. If the start position and CIGAR are identical, then the reads are duplicates. So the program only leaves one read untouched and marks the others so they can be ignored in downstream analyses. If the multiple observations of 'A' are on duplicate reads, then they are artifacts and you shouldn't use them as supporting evidence for a variant call.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Cambridge, UKPosts: 111Member ✭✭✭

    I do recall one group that did something similar added a random 3 base barcode tag to their templates prior to amplification. The random tag was then removed after sequencing and turned into a BAM tag could then be used to distinguish the PCR dups from those that were dups by chance. This required a special version of the mark-dup tool though.

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 16Member

    I was just wondering what the ramifications are of using "samtools rmdup" instead of MarkDuplicates in Picard? The specific reason is that I have a pipeline where a number of the steps are carried out in samtools with the variants called in VarScan. It would be inconvenient at this stage to do variant calling in a program other than VarScan; because we have our own program that interprets the VarScan output (although it probably should have been written to understand VCF from the start). Anyway, I'm now interested in using GATK's local re-alignment around indels and possibly GATK's base recalibration before calling variants in VarScan. I'm just wondering if that is safe with BAM files processed with "samtools rmdup" (after "samtools sort") instead of Picard's MarkDuplicates.

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 16Member

    I probably posted this in the wrong thread - I noticed that there's another short thread comparing "samtools rmdup" and Picard's MarkDuplicates. Anyway, another issue is that VarScan needs output from "samtools mpileup" which in turn may not understand "marked duplicates" from Picard; so maybe I have to use "samtools rmdup" for that reason.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    @WVNicholson We don't recommend samtools rmdup because it is not as thorough at marking duplicates as Picard MarkDuplicates. For the rest, you should be able to run any combination of these tools; GATK tools will take any valid bam file as input, and I expect samtools mpileup will do the same. We and the samtools organization are on the same wavelength when it comes to standardizing inputs and outputs.

    If it's at all an option I would encourage you to switch to GATK for variant calling and simply find or write a converter to convert the VCF to whatever VarScan output looks like... but I'm obviously biased :)

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Cambridge, UKPosts: 111Member ✭✭✭
    edited March 9

    @WVNicholson The Samtools maintainers do not recommend the use of samtools rmdup, we kept it around because people occasionally need to duplicate old pipelines. Basically rmdup only considers each read in isolation for determining duplicate status and I still haven't got around to finishing my fix for it. If you want speed, I recommend biobambam which was designed to mark the same duplicates as Picard. Also samtools mpileup understands Picard marked duplicates perfectly well, it's a standard SAM flag.

    Post edited by TechnicalVault on

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute and Genetic Epidemiology Group - WTSI & Cambridge University

Sign In or Register to comment.