Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

Very slow download of Files from Google Cloud Storage

We're using Cromwell with Google Cloud Storage and Google's Pipeline API and have observed that transferring files to GCS once a task outputs it's files is extremely fast (~13 seconds for 978 files). By contrast, transferring the files to a new task (and it's associated new VM) is extremely slow - about 532 seconds, which appears due to the way Cromwell copies files from GCS (issuing a single gsutil cp command for each and every file).

An example copy command of a single file:

sudo gsutil -q -m cp gs://test-bucket/wdl_runner/work/cs/16738fac-5146-4a3c-9cfa-d5ded7f199fc/call-demultiplex_and_sample_prep/glob-9c1244b6ebf22abec57cd494340f8c79/CL101_invASISTR_segment_0.fasta /mnt/local-disk/test-bucket/wdl_runner/work/cs/16738fac-5146-4a3c-9cfa-d5ded7f199fc/call-demultiplex_and_sample_prep/glob-9c1244b6ebf22abec57cd494340f8c79/CL101_invASISTR_segment_0.fasta

The -m for performing a multi-threaded copy is enabled, which is great, but has no effect since the command is only copying a single file. Is there any way to change the copy command so that it can download an entire bucket? Or some other way to make the file transfer more efficient?


  • ThibThib ✭✭ CambridgeMember, Broadie, Dev ✭✭

    Unfortunately at the moment this gsutil command is an artifact of the Pipelines API and Cromwell doesn't have control over it.
    However, the version 2 of the API that we're going to start working on supporting imminently now will give us control over localization and possible performance optimization.
    For now, you could maybe try zipping the files, or use something like NIO in your tool to only read parts of the file if that's possible / makes sense.

  • edpark_clearlabsedpark_clearlabs Member

    Hi Thib, thanks for the reply.

    Would there even be a way to modify the Cromwell source ourselves in a branch so that we could override this behavior?
    Do you happen to know when version 2 is scheduled for release?

  • edpark_clearlabsedpark_clearlabs Member
    edited April 2018

    Just fyi, to get around this problem we've resorted to a hack where we only send a single File to a task from the previous task where we can pull out the directory and then directly use gsutil cp to copy an entire bucket to a local folder. Something like this:

    task foo_task {
      File file
      command <<<
      # Remove the leading /cromwell_root reference
      FILE=$(echo ${file} | sed -e "s/\/cromwell_root\///g")
      # Obtain the path which mirrors the path to the files on GSC
      DIR=$(dirname $FILE)
      # Create a directory to hold the files to be copied
      mkdir -p /cromwell_root/stage_folder
      # Copy files from GSC to the local VM
      time gsutil -q -m cp -R gs://$DIR /cromwell_root/stage_folder

    Hopefully that helps someone facing the same problem.
    It's way faster to copy the files from GSC in this manner.

Sign In or Register to comment.