removing intermediate files during Cromwell runtime

thedamthedam BarcelonaMember

I'm running a typical "GATK best practices" pipeline using Cromwell + WDL on my LOCAL server.
I have many intermediate files (sam, sorted.bam, recalibrated.bam, rmdup.bam) that take disk space.
With some not elegant tricks I try to squeeze commands as much as possible, for example:

command {
    bwa mem -M -t 40 -R "@RG\tID:${sampleName}\tSM:${sampleName}\tPL:ILLUMINA\tLB:lib1\tPU:unit" ${REF} ${fastq1} ${fastq2}  > ${sampleName}.sam;

samtools sort [email protected] -O BAM -o sorted.bam ${sampleName}.sam && samtools index [email protected] sorted.bam;

rm ${sampleName}.sam

}

But still the intermediate .bam files remain and I have to remove them manually, what destroys the whole beauty of an "automatic pipeline". Also at the end I need to move/copy the final .bam files to my preferred location - this takes time and again my manual work.

Is there any solution/approach in WDL + Cromwell to apply tasks like this:

bwa -> out.sam
out.sam -> sort -> sorted.bam
rm out.sam
sorted.bam -> MarkDuplicates -> marked.bam
rm sorted.sam
BaseRecalibrator -> recalibrated.bam
rm marked.bam

So every unnecessary, big .bam file is removed immediately.

ps. I know the similar thread here https://gatkforums.broadinstitute.org/wdl/discussion/9818/deleting-intermediate-files but I didn't find any happy solution; also we are 1 year older so maybe something has changed...

Thanks for any help
Damian

Answers

  • danbdanb Member, Broadie ✭✭

    Hi @thedam , assuming you are also using local filesystem:

    We don't have a cleanup phase per se, but-

    We do a strategy of symlink -> hardlink -> copy. Can you make sure that the files you are seeing in the inputs directory are indeed copies? Also are you seeing any warnings that are roughly "[sym/hard] link failed"?

    NB: you can check number of references to a hard linked inode via ls -l

  • thedamthedam BarcelonaMember

    Hi @danb,
    it's not about inputs, but about the proper outputs. So in the example pipeline:

    input -> TASK1 -> bam1
    bam1-> TASK2 -> bam2
    bam2-> TASK3 -> bam3
    bam3 ->TASK4 -> bam4

    bold bams on the right site are the big outputs (~10GB).
    italic bams (inputs) on the left are symlinks so they are not the problem.

    I'd like to remove bam1, bam2, bam3 whenever it's not used anymore.

    this trick doesn't work:

    task TASK2 {
        File inputFromTASK1
        command {
                    someProgram -i ${inputFromTASK1} -o bam2;
                    rm ${inputFromTASK1}
        }
    
        output {
            File outputTASK2 = "${bam2}"
        }
    }
    

    I don't need 40GB output per patient, I just need 10GB. When the pipeline with ~30 patients is running, the output is like 30*40GB=1200GB, instead of 300GB

    Cheers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    That trick probably isn't working because the File input is re-localized for TASK2, and it's the re-localized version that's being deleted.

    What I'd suggest trying until Cromwell adds the clean-up feature (with the proviso that this makes the WDL less portable), is changing that File input to a String input. That way you'll be given the path to the original file, rather than a re-localized copy.

Sign In or Register to comment.