To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

[WDL][Cromwell[ Mounting a directory to the docker for access.


I am attempting to run Gemini within a docker through WDL and Cromwell. I have installed gemini with no data as the data is too large to be put into a Docker (plus it's bad practice). So I need to download the data elsewhere, and link it to be available for the gemini binary to access. Locally on my own machine without WDL, I might run the following to get this to work:

docker run -rm -v /path/to/local/gemini/data:/path/to/container/gemini/data -i gemini load -t VEP -v my.vcf my.db

At the bottom is have outlined my submission script with google genomics pipelines run and the yaml configuration for background. However, the crux of my problem is that I am unsure with the Broad docker image for wdl_runner what the mount procedure for the docker is.

In the WDL documentation, for local backends, the docker by default does the following:

docker run --rm -v <cwd>:<docker_cwd> -i <docker_image> /bin/bash < <script>

Now supposing I have my data in a google bucket at gs://my_bucket/data_for_gemini. How would I define in WDL the appropriate code to mount that google bucket directory so gemini inside the docker can access it?

Example WDL:

task Gemini {
    File my_vcf
    # how to pass an entire google bucket directory as a target site?

    command {
        # define mounts in here somehow?
        gemini load -t VEP -v ${my_vcf} out.db
    runtime {
        # define mounts in here?
        docker: ""
        memory: "4 GB"
        cpu: "1"
    output {
        File gemini_db = "out.db"

I have thought one inelegant solution would be to run a docker in a docker and mount via that way. But I wanted to know if there would be a better and more elegant way.

-- Derrick DeConti

My submission script is:

gcloud alpha genomics pipelines run \
        --pipeline-file wdl_pipeline.yaml \
        --zones us-east1-b \
        --logging gs://dfci-cccb-pipeline-testing/logging \
        --inputs-from-file  \
        --inputs-from-file \
        --inputs-from-file \
        --inputs WORKSPACE=gs://dfci-cccb-pipeline-testing/workspace \
        --inputs OUTPUTS=gs://dfci-cccb-pipeline-testing/outputs

The resultant yaml is as follows:

name: WDL Runner
description: Run a workflow defined by a WDL file

- name: WDL
  description: Workflow definition
  description: Workflow inputs
  description: Workflow options

  description: Cloud Storage path for intermediate files
- name: OUTPUTS
  description: Cloud Storage path for output files


  cmd: >

  minimumRamGb: 1

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Derrick, I'm not familiar with Gemini. Can you clarify how the program normally expects to access this data, and whether there is a mechanism in place to set an alternate location?
  • decontideconti DFCIMember

    Hi, Geraldine.

    Gemini accesses the data via an option flag on the command line during installation of the software. The option flag would point to a directory. As it's considered bad practice to load all that data into a Docker, it's suggested to create a docker volume. So, what I would do on my local machine is install gemini:

    # Installs gemini with no data, with libraries and binaries for use in /usr/bin_dir/gemini
    python /usr/bin_dir/gemini-0.19.1/gemini/scripts/ --nodata  /usr/bin_dir/gemini /usr/bin_dir/gemini

    The next step is to then load the data:

    # Download necessary data sources to /usr/bin_dir/gemini/data
    # /data is implied when pointing to the directory you want the data.
    /usr/bin_dir/gemini-0.19.1/gemini/anaconda/bin/python /usr/bin_dir/gemini-0.19.1/gemini-0.19.1/gemini/ /usr/bin_dir/gemini
    # Then load some more data into the data directory
    gemini update --dataonly --extra cadd_score
    gemini update --dataonly --extra gerp_bp

    From here, if I'm following suggested practices with docker, I would instead download the data to my computer locally (i.e. non-docker), then mount the data instead to /usr/bin_dir/gemini/data, and change gemini's yaml configuration file to point the annotation directory to this mounted location (/usr/bin_dir/gemini/data).

    # mounts my local direcotry to the docker directory.
    # runs gemini to load a vcf
    docker run -t -v /my/local/gemini/data:/usr/bin_dir/gemini/data -i gemini gemini load -t VEP -v my_vcf.vcf my_gemini_db.db

    I hope that helps explain it.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @deconti, thanks for the explanation. I'm not familiar with this enough myself but I'll ask one of our engineers to help propose a solution.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @deconti @ChrisL We'll put in the Cromwell ticket

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • decontideconti DFCIMember

    Thanks, ChrisL.

    That answers my question. (I also did not realize I could not perform docker in docker via JES.)

    I'll either try getting Docker to incorporate all the downloaded data into one container. Or I'll just have to have Gemini run outside of Cromwell and WDL.

    Thanks, again.

Sign In or Register to comment.