Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." Our forum vendors have implemented a fix, and now we are just waiting on a patch to be released. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...

Very slow download of Files from Google Cloud Storage

We're using Cromwell with Google Cloud Storage and Google's Pipeline API and have observed that transferring files to GCS once a task outputs it's files is extremely fast (~13 seconds for 978 files). By contrast, transferring the files to a new task (and it's associated new VM) is extremely slow - about 532 seconds, which appears due to the way Cromwell copies files from GCS (issuing a single gsutil cp command for each and every file).

An example copy command of a single file:

sudo gsutil -q -m cp gs://test-bucket/wdl_runner/work/cs/16738fac-5146-4a3c-9cfa-d5ded7f199fc/call-demultiplex_and_sample_prep/glob-9c1244b6ebf22abec57cd494340f8c79/CL101_invASISTR_segment_0.fasta /mnt/local-disk/test-bucket/wdl_runner/work/cs/16738fac-5146-4a3c-9cfa-d5ded7f199fc/call-demultiplex_and_sample_prep/glob-9c1244b6ebf22abec57cd494340f8c79/CL101_invASISTR_segment_0.fasta

The -m for performing a multi-threaded copy is enabled, which is great, but has no effect since the command is only copying a single file. Is there any way to change the copy command so that it can download an entire bucket? Or some other way to make the file transfer more efficient?

Answers

  • ThibThib CambridgeMember, Broadie, Dev

    Unfortunately at the moment this gsutil command is an artifact of the Pipelines API and Cromwell doesn't have control over it.
    However, the version 2 of the API that we're going to start working on supporting imminently now will give us control over localization and possible performance optimization.
    For now, you could maybe try zipping the files, or use something like NIO in your tool to only read parts of the file if that's possible / makes sense.

  • Hi Thib, thanks for the reply.

    Would there even be a way to modify the Cromwell source ourselves in a branch so that we could override this behavior?
    Do you happen to know when version 2 is scheduled for release?

  • edited April 11

    Just fyi, to get around this problem we've resorted to a hack where we only send a single File to a task from the previous task where we can pull out the directory and then directly use gsutil cp to copy an entire bucket to a local folder. Something like this:

    task foo_task {
      File file
    
      command <<<
      # Remove the leading /cromwell_root reference
      FILE=$(echo ${file} | sed -e "s/\/cromwell_root\///g")
      # Obtain the path which mirrors the path to the files on GSC
      DIR=$(dirname $FILE)
      # Create a directory to hold the files to be copied
      mkdir -p /cromwell_root/stage_folder
      # Copy files from GSC to the local VM
      time gsutil -q -m cp -R gs://$DIR /cromwell_root/stage_folder
      >>>
      ..
    }
    

    Hopefully that helps someone facing the same problem.
    It's way faster to copy the files from GSC in this manner.

Sign In or Register to comment.