If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MarkDuplicates Queue extension and intermediate files

pdexheimerpdexheimer Member ✭✭✭✭

I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams.

So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool.

A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution.


  • CarneiroCarneiro Charlestown, MAMember admin

    The data processing pipeline is really only intended as a reference for users to write their own pipeline. It was written many years ago, it's not maintained and does not necessary reflect our best practices. That being said, you are more than welcome to make changes and use it however you want. In particular, I think that the changes you suggest here are very sensible.

  • pdexheimerpdexheimer Member ✭✭✭✭

    Thanks, but the problem is that the changes I'm talking about are in the Queue Picard extensions, not in DPP itself. I can work around it (my copy of DPP now includes a MyMarkDuplicates class), but I thought this particular change, while making it more convenient for me, would also help other people trying to use this extension

  • CarneiroCarneiro Charlestown, MAMember admin

    You're right about the metrics file being required, I've changed it on MarkDuplicates.scala.

    But why do you want to switch from @Output to @Argument?

    If you don't want the metrics file to be deleted, just add an isIntermediate = false to the mark duplicates class.

  • pdexheimerpdexheimer Member ✭✭✭✭

    The problem is that I only want certain outputs to be intermediate. In this specific case of DPP, I'm running clean-dedup-recal in that order. I want the bams that MarkDuplicates produces to be intermediate, but I want the metrics file to be permanent. With the system as it is now, I can either clear isIntermediate and delete the intermediate bams manually, or set isIntermediate and lose the duplication metrics.

    So setting the metrics file as an @Argument is really just a hack to allow it to persist even when MarkDuplicates has isIntermediate set. Without a more granular intermediate system, it's the only solution I could come up with (although maybe the correct approach is to have an extra field in the @Output metadata that could override intermediate-ness). In the bigger picture, I think most use cases for that file involve manual review (as opposed to another step in a Queue pipeline) - though I could certainly be wrong about that. But even if I am, this file is much smaller than most of the .out files, so it's probably not going to hurt anything to have it persist

Sign In or Register to comment.