File streaming

dlivitzdlivitz Member, Broadie
edited February 2016 in Ask the FireCloud Team

I have broken down some of the time that is lost in running tasks in firecloud.

In this example, I am running coverage collection on a single whole exome bam from the co-located tcga bucket, 12GB in size.

Two minutes are taken to download the docker image. The image is just over a GB.
This is slow, but that isn't too much of a problem.

It then takes eight minutes to copy the inputs.

I've attached the log - it takes 10 minutes to copy all of the resource and input files into the docker. In total is about 15GB. (3GB reference and 12GB bam)

Then the coverage job also takes 10 minutes to run.

In total, half of my runtime is taken to copy the inputs, but in effect it is reading the bam twice, once to copy it and once when it runs. In effect we are paying for double the compute for tasks that need to read a bam once.

If there was an option to pass the files as streams and not localizing everything before running, it would solve the double read issue.

Best Answer


  • dlivitzdlivitz Member, Broadie

    Also because of this behavior, I have to specify dedicated local storage on the container instance, so we are also paying for storage that we technically do not need since the output of this task is a 10MB file with no intermediates, and it should just be reading the file directly from the co-located bucket.

  • dheimandheiman Member, Broadie ✭✭

    Was any headway ever made on this? GAWB-144 remains unresolved after over a year, and I've had similar issues where mounting/streaming data from an existing bucket would be vastly preferable to copying.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @dheiman, I'm checking with the team to be sure, but my understanding of this is that it's a limitation of the Google platform that we can't currently work around.

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    We continue to actively discuss this, and it's likely that the Google genomics APIs we use will be changing in ways that let us support this (namely the ability to have your credentials on the VM so you can actually stream data). Right now however your user does not have credentials on the VM where your tasks are run, so you can't get stream from buckets. It is a major architectural change in both Google and FireCloud to support this.

    Note: Below are some suggestions that are potentially insecure - do are your own risk! If you try any of these please be aware that the security of your approaches is entirely owned by you. The VMs that are spun up for each call are secure, so if you do anything on those machines, users will not be able to access those files, etc - but you need to just be sure the to and from to the VM is done in a secure way, and private data is secured at rest.

    I think you could get around this for now in two ways by:

    • Using presigned URLs for each object you want to access within that VM - we prototyped this and were successful, however you'd need to figure out about how you can presign them
    • Store your credentials securely somewhere (e.g. a bucket that only you have access to), localize that file with your task, use those credentials to stream your data.
  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    This is also a key issue for us for porting Genome STRiP to firecloud.
    Currently, when I'm running a job on a firecloud VM, what are the application default credentials?

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I should perhaps mention that our use case is maybe a bit different. We run multi-sample pipelines that benefit from processing hundreds of samples together. It is impractical to download the data for the hundreds of samples to the target VM, but our code is designed to efficiently stream from the source bucket if we can successfully open the files through the gs: url. We are able to run successfully on GCE instances outside of firecloud, but we want to be able to run within firecloud.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @abaumann , can you give us a rough estimate (next month? next quarter? next year?) that file streaming will be available (i.e., when my user credentials will be available within the container for use with streaming)?

    In the interim, we may just proceed with your second suggestion (storing credentials in a bucket)...I'm already using this approach for GDC file retrieval.

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    I can't really say how long on the file streaming - more likely a year off type thing since it's a big architectural change so it's hard to anticipate, but we know we need to do it. The suggestion I gave was based upon your success doing it for GDC and I think this is a perfectly valid approach to try. You will need to write your tools to be more responsive to transient errors when trying to stream, but other than that this should work well.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @dlivitz suggested I use the gcs-fuse. I'm not sure whether gcs-fuse would handle the transient errors. I'll have to read up on the package.

  • gordon123gordon123 BroadMember, Broadie

    My understanding is that the key sticking point is getting credentials to the VM in a secure way. gcs-fuse will not work unless the data is public access, or you solve that in ways like abaumann suggested in this thread on March 20.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Yes, what abaumann suggested is what I have already done for retrieval of controlled access GDC files.

Sign In or Register to comment.