Picard MarkDuplicates vs samtools rmdup for variant calling with GATK

SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

Dear GATK team,

I'm going to do variant calling for several tens of samples using hg38 reference with GATK. I have several questions about this process. They are partially covered on forums and in FAQs, but I'd like to clarify some points:

1) Am I right that MarkDuplicates can process a BAM file that contains both paired-end and single-end reads? (Picard FAQ hints it can, but just to be sure.)

2) Am I right that MarkDuplicates is significantly slower than samtools rmdup (because of its algorithm that marks not only dupes from the same chromosome, but also dupes from different chromosomes)?

3) Is there any evidence that use of MarkDuplicates is significantly better for the downstream analysis with GATK than use of samtools rmdup? (Of course, MarkDuplicates is used in the Best Practices, but Picard tools are used everywhere in that guide.)

Remarks:

1) I use bowtie2 --very-sensitive for read mapping.

2) I'd like to get a gVCF file for each sample.

Issue · Github
by Sheila

Issue Number
475
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

    Thank you very much for the detailed answer!

  • raman91raman91 SingaporeMember

    Hi shlee, I was using Picard MarkDuplicates and have a question.
    I mapped my reads using BWA, added the read group, sorted and indexed the reads. Now I am doing MarkDuplicates step. But i think there are multiple reads with the same name and I am getting the error : Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. How should i solve this error? Thanks

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @raman91,

    Are you running the latest version of GATK? Updating to the latest version solved the issue for this user.

Sign In or Register to comment.