To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Incremental files between wdl-calls

Hi,

inspired by a post from @Vzzarr https://gatkforums.broadinstitute.org/gatk/discussion/10529/how-to-run-the-entire-pipeline-using-even-spark-tools-from-java#latest I would like to ask if it would be possible to write only the changes (deltas) to a file as an output (like the way common backup-solutions do)? The aim is to reduce time/space/operations for writing the intermediate files during a wdl-workflow.

As an example think of the pre-processing part of the pipeline, which creating several bam-files, but every bam-files changes only a bit in every step, seen by the level of information.

Greetings EADG

Tagged:

Answers

  • kshakirkshakir Broadie, Dev

    I'm not sure I 100% follow the proposed workflow, but it's possible one may get 80% of what is described by using GATK4 GCS NIO features as it supports streaming BAM data. That way a workflow would save on localizing input BAM data during each step of the pipeline, especially if scattering were used to cover multiple genomic locations in parallel.

    Perhaps each "delta" being written could just be a new call output, then chained/passed in to a downstream call as a new input.

    Besides something like that (again if I'm following correctly) mutating-input-files is not currently a feature in Cromwell. Still one may imagine using a number of workarounds that use existing external storage techniques, with differing levels of complexity. Depending on the project / data liveliness perhaps one could use a generic Spark RDD, MySQL, MongoDB, BigQuery, etc. or something even more specialized like Hail or GenomicsDB.

Sign In or Register to comment.