[WDL][Cromwell[ Mounting a directory to the docker for access.


I am attempting to run Gemini within a docker through WDL and Cromwell. I have installed gemini with no data as the data is too large to be put into a Docker (plus it's bad practice). So I need to download the data elsewhere, and link it to be available for the gemini binary to access. Locally on my own machine without WDL, I might run the following to get this to work:

docker run -rm -v /path/to/local/gemini/data:/path/to/container/gemini/data -i gemini load -t VEP -v my.vcf my.db

At the bottom is have outlined my submission script with google genomics pipelines run and the yaml configuration for background. However, the crux of my problem is that I am unsure with the Broad docker image for wdl_runner what the mount procedure for the docker is.

In the WDL documentation, for local backends, the docker by default does the following:

docker run --rm -v <cwd>:<docker_cwd> -i <docker_image> /bin/bash < <script>

Now supposing I have my data in a google bucket at gs://my_bucket/data_for_gemini. How would I define in WDL the appropriate code to mount that google bucket directory so gemini inside the docker can access it?

Example WDL:

task Gemini {
    File my_vcf
    # how to pass an entire google bucket directory as a target site?

    command {
        # define mounts in here somehow?
        gemini load -t VEP -v ${my_vcf} out.db
    runtime {
        # define mounts in here?
        docker: "gcr.io/my_containers/gemini"
        memory: "4 GB"
        cpu: "1"
    output {
        File gemini_db = "out.db"

I have thought one inelegant solution would be to run a docker in a docker and mount via that way. But I wanted to know if there would be a better and more elegant way.

-- Derrick DeConti

My submission script is:

gcloud alpha genomics pipelines run \
        --pipeline-file wdl_pipeline.yaml \
        --zones us-east1-b \
        --logging gs://dfci-cccb-pipeline-testing/logging \
        --inputs-from-file WDL=VariantCalling.cloud.wdl  \
        --inputs-from-file WORKFLOW_INPUTS=VariantCalling.cloud.inputs.json \
        --inputs-from-file WORKFLOW_OPTIONS=VariantCalling.cloud.options.json \
        --inputs WORKSPACE=gs://dfci-cccb-pipeline-testing/workspace \
        --inputs OUTPUTS=gs://dfci-cccb-pipeline-testing/outputs

The resultant yaml is as follows:

name: WDL Runner
description: Run a workflow defined by a WDL file

- name: WDL
  description: Workflow definition
  description: Workflow inputs
  description: Workflow options

  description: Cloud Storage path for intermediate files
- name: OUTPUTS
  description: Cloud Storage path for output files

  imageName: gcr.io/broad-dsde-outreach/wdl_runner

  cmd: >

  minimumRamGb: 1

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Derrick, I'm not familiar with Gemini. Can you clarify how the program normally expects to access this data, and whether there is a mechanism in place to set an alternate location?
  • decontideconti DFCIMember

    Hi, Geraldine.

    Gemini accesses the data via an option flag on the command line during installation of the software. The option flag would point to a directory. As it's considered bad practice to load all that data into a Docker, it's suggested to create a docker volume. So, what I would do on my local machine is install gemini:

    # Installs gemini with no data, with libraries and binaries for use in /usr/bin_dir/gemini
    python /usr/bin_dir/gemini-0.19.1/gemini/scripts/gemini_install.py --nodata  /usr/bin_dir/gemini /usr/bin_dir/gemini

    The next step is to then load the data:

    # Download necessary data sources to /usr/bin_dir/gemini/data
    # /data is implied when pointing to the directory you want the data.
    /usr/bin_dir/gemini-0.19.1/gemini/anaconda/bin/python /usr/bin_dir/gemini-0.19.1/gemini-0.19.1/gemini/install-data.py /usr/bin_dir/gemini
    # Then load some more data into the data directory
    gemini update --dataonly --extra cadd_score
    gemini update --dataonly --extra gerp_bp

    From here, if I'm following suggested practices with docker, I would instead download the data to my computer locally (i.e. non-docker), then mount the data instead to /usr/bin_dir/gemini/data, and change gemini's yaml configuration file to point the annotation directory to this mounted location (/usr/bin_dir/gemini/data).

    # mounts my local direcotry to the docker directory.
    # runs gemini to load a vcf
    docker run -t -v /my/local/gemini/data:/usr/bin_dir/gemini/data -i gemini gemini load -t VEP -v my_vcf.vcf my_gemini_db.db

    I hope that helps explain it.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @deconti, thanks for the explanation. I'm not familiar with this enough myself but I'll ask one of our engineers to help propose a solution.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @deconti @ChrisL We'll put in the Cromwell ticket

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • decontideconti DFCIMember

    Thanks, ChrisL.

    That answers my question. (I also did not realize I could not perform docker in docker via JES.)

    I'll either try getting Docker to incorporate all the downloaded data into one container. Or I'll just have to have Gemini run outside of Cromwell and WDL.

    Thanks, again.

  • g3n3g3n3 Member

    Hello GATK team, I had a similar issue where I was hoping I could mount a tmpfs type mount. I posted a note on the github issue that is related to this as well:


    Mainly we are hoping that the docker can be launched with something like (from https://docs.docker.com/storage/tmpfs/#limitations-of-tmpfs-containers)

    $ docker run -d \
      -it \
      --name tmptest \
      --mount type=tmpfs,destination=/app \

    Were we can mount a tmpfs volume and declare its mount point on google cloud.

    We currently do this on our local cromwell runs by giving the submit a docker run with our own runtime parameter

    ${'--mount type=tmpfs,destination='+mount_tmpfs}

    This lets us use a ramdisk to unpack tens of thousands of files in seconds. I don't see a strait forward way to add this for the google cloud submit so if it can be supported in part of this feature request it would be great!


  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Jason,

    Sorry for the delay. I was away at a workshop. Let me ask someone else form the team answer you soon.


  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    I'd suggest waiting for the WDL Directory type to be added, at which point whether the path gets localized via a direct "copy everything onto the VM" or a less heavyweight "mount path as a volume" could be a configuration/runtime/customization option.

    Warning! What I said above is true unless you're expecting this mount to be read/write, which I would say will probably not be coming. The WDL language and Cromwell engine are pretty heavily based on the assumption that the values they move around are immutable, so altering Directories in place as part of a command would probably cause more problems than it solves...

Sign In or Register to comment.