Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

MarkDuplicates Queue extension and intermediate files

pdexheimerpdexheimer Posts: 298Member, GSA Collaborator ✭✭✭

I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams.

So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool.

A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution.

Answers

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    The data processing pipeline is really only intended as a reference for users to write their own pipeline. It was written many years ago, it's not maintained and does not necessary reflect our best practices. That being said, you are more than welcome to make changes and use it however you want. In particular, I think that the changes you suggest here are very sensible.

  • pdexheimerpdexheimer Posts: 298Member, GSA Collaborator ✭✭✭

    Thanks, but the problem is that the changes I'm talking about are in the Queue Picard extensions, not in DPP itself. I can work around it (my copy of DPP now includes a MyMarkDuplicates class), but I thought this particular change, while making it more convenient for me, would also help other people trying to use this extension

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    You're right about the metrics file being required, I've changed it on MarkDuplicates.scala.

    But why do you want to switch from @Output to @Argument?

    If you don't want the metrics file to be deleted, just add an isIntermediate = false to the mark duplicates class.

  • pdexheimerpdexheimer Posts: 298Member, GSA Collaborator ✭✭✭

    The problem is that I only want certain outputs to be intermediate. In this specific case of DPP, I'm running clean-dedup-recal in that order. I want the bams that MarkDuplicates produces to be intermediate, but I want the metrics file to be permanent. With the system as it is now, I can either clear isIntermediate and delete the intermediate bams manually, or set isIntermediate and lose the duplication metrics.

    So setting the metrics file as an @Argument is really just a hack to allow it to persist even when MarkDuplicates has isIntermediate set. Without a more granular intermediate system, it's the only solution I could come up with (although maybe the correct approach is to have an extra field in the @Output metadata that could override intermediate-ness). In the bigger picture, I think most use cases for that file involve manual review (as opposed to another step in a Queue pipeline) - though I could certainly be wrong about that. But even if I am, this file is much smaller than most of the .out files, so it's probably not going to hurt anything to have it persist

Sign In or Register to comment.