I have broken down some of the time that is lost in running tasks in firecloud.
In this example, I am running coverage collection on a single whole exome bam from the co-located tcga bucket, 12GB in size.
Two minutes are taken to download the docker image. The image is just over a GB.
This is slow, but that isn't too much of a problem.
It then takes eight minutes to copy the inputs.
I've attached the log - it takes 10 minutes to copy all of the resource and input files into the docker. In total is about 15GB. (3GB reference and 12GB bam)
Then the coverage job also takes 10 minutes to run.
In total, half of my runtime is taken to copy the inputs, but in effect it is reading the bam twice, once to copy it and once when it runs. In effect we are paying for double the compute for tasks that need to read a bam once.
If there was an option to pass the files as streams and not localizing everything before running, it would solve the double read issue.