Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

the importance of sorting markduplicate output files

I discovered that several outputfiles from "markduplicates" (picard) was sent on in the "best-practice workflow (GATK4.1) " without sorting through BaseRecalibrator, BQSR (gatk ApplyBQSR) and Haplotypecaller to make g.vcf files of each sample. No obvious problems or error msg. The plan is to combine these one in a common vcf-file for variant calling. Is there a need to rerun and get these samples (coordinate) sorted after markduplicates and then rerun BaseRecalibrator, ApplyBQSR and Haplotypecallerto avoid errors?

Best Answer

  • flcomplex2016flcomplex2016 Norway
    Accepted Answer

    Thank you very much for following this up so thoroughly and responding again; Good to hear that SortSam after MarkDuplicates is not needed when samtools sort is used after alignment - This saves a lot of work and time of re-processing a number of samples in all the "Best Practice"-steps after Markduplicates :-)

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @flcomplex2016 My apologies if I do not fully understand your question, but yes, you probably want to run through those steps for all of your samples so that they are processed similarly. MarkDuplicates SAM or BAM files must be coordinate sorted.

  • flcomplex2016flcomplex2016 NorwayMember

    Thanks. The reason I asked the question is that after the BWA alignment the files are sorted by "Samtools sort". Then I have run Markduplicates on files that are already sorted, and thought it might not be necessary to do it again after marking of duplicates? For all new samples its of course not a problem to do the sorting, it just need a lot of time and CPUtime to pick up a numer of old samples and reprocess them after the markduplicate sted.

    (Sheila (earlier) also replied to a question I had on combination of GATK 3.8 "processed" and 4.1 processed g.vcf files. As I understood her answer; she indicated that it could be OK if the g.vcfs-step were done by haplotypecalling in GATK4. In this case the samples would not be processed totally similar)

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @flcomplex2016 , I see what you are saying. I want to correct my earlier statement about MarkDuplicates only taking coordinate sorted sam or bams. Part of the tool doc doesn't appear to be updated and we will get that fixed. This may be something you want to consider since 'samtools sort' does coordinate sorting.

    The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

  • flcomplex2016flcomplex2016 NorwayMember

    Thanks again. I am sorry but I still do not fully understand this. At the moment I have the feeling that we are discussing two different things (?). My initial concern was the sorting AFTER Markduplicates (since you either can do MarkduplicateSpark or Markduplicates+SortSam; under "Main step:Markduplicate:Tools involved": https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165). I interpret Your comments as the input TO Markduplicate?
    In my case the inputfile to Markduplicate (after alignment to reference genome) WAS sortet by samtool sort (=coordinat sorted?) - then Marduplicate was run - but the files were NOT sortet (again) after Markduplicates (as suggested in the instructions). So the missing last sorting after Markduplicates was the one that made me worry.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Yes, I understand your question. When running Mark Duplicates I see an info field that says MarkDuplicates Output will not be re-sorted. Output header will state SO:unknown GO:query
    So you probably should use SortSam afterwards (Haplotypecaller requires sorted reads). In addition, samtools sort doesn't change the header to indicate that it was sorted, while SortSam does.

    My earlier post was to correct my statement and make you aware of the difference in behavior since query-sorted is more stringent.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
    edited August 20

    Alright, I followed up with a developer and they said you shouldn't have to run SortSam after MarkDuplicates if you sorted before. I mentioned the behavior difference based on sorting. Sorry for the back and forth and any confusion I caused.

    Post edited by Tiffany_at_Broad on
  • flcomplex2016flcomplex2016 NorwayMember
    Accepted Answer

    Thank you very much for following this up so thoroughly and responding again; Good to hear that SortSam after MarkDuplicates is not needed when samtools sort is used after alignment - This saves a lot of work and time of re-processing a number of samples in all the "Best Practice"-steps after Markduplicates :-)

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    You're welcome!

Sign In or Register to comment.