[WDL][Cromwell[ Mounting a directory to the docker for access.

Hi,

I am attempting to run Gemini within a docker through WDL and Cromwell. I have installed gemini with no data as the data is too large to be put into a Docker (plus it's bad practice). So I need to download the data elsewhere, and link it to be available for the gemini binary to access. Locally on my own machine without WDL, I might run the following to get this to work:

docker run -rm -v /path/to/local/gemini/data:/path/to/container/gemini/data -i gemini load -t VEP -v my.vcf my.db

At the bottom is have outlined my submission script with google genomics pipelines run and the yaml configuration for background. However, the crux of my problem is that I am unsure with the Broad docker image for wdl_runner what the mount procedure for the docker is.

In the WDL documentation, for local backends, the docker by default does the following:

docker run --rm -v <cwd>:<docker_cwd> -i <docker_image> /bin/bash < <script>

Now supposing I have my data in a google bucket at gs://my_bucket/data_for_gemini. How would I define in WDL the appropriate code to mount that google bucket directory so gemini inside the docker can access it?

Example WDL:

task Gemini {
    File my_vcf
    # how to pass an entire google bucket directory as a target site?

    command {
        # define mounts in here somehow?
        gemini load -t VEP -v ${my_vcf} out.db
    }
    runtime {
        # define mounts in here?
        docker: "gcr.io/my_containers/gemini"
        memory: "4 GB"
        cpu: "1"
    }
    output {
        File gemini_db = "out.db"
    }
}

I have thought one inelegant solution would be to run a docker in a docker and mount via that way. But I wanted to know if there would be a better and more elegant way.

-- Derrick DeConti

My submission script is:

gcloud alpha genomics pipelines run \
        --pipeline-file wdl_pipeline.yaml \
        --zones us-east1-b \
        --logging gs://dfci-cccb-pipeline-testing/logging \
        --inputs-from-file WDL=VariantCalling.cloud.wdl  \
        --inputs-from-file WORKFLOW_INPUTS=VariantCalling.cloud.inputs.json \
        --inputs-from-file WORKFLOW_OPTIONS=VariantCalling.cloud.options.json \
        --inputs WORKSPACE=gs://dfci-cccb-pipeline-testing/workspace \
        --inputs OUTPUTS=gs://dfci-cccb-pipeline-testing/outputs

The resultant yaml is as follows:

name: WDL Runner
description: Run a workflow defined by a WDL file

inputParameters:
- name: WDL
  description: Workflow definition
- name: WORKFLOW_INPUTS
  description: Workflow inputs
- name: WORKFLOW_OPTIONS
  description: Workflow options

- name: WORKSPACE
  description: Cloud Storage path for intermediate files
- name: OUTPUTS
  description: Cloud Storage path for output files

docker:
  imageName: gcr.io/broad-dsde-outreach/wdl_runner

  cmd: >
    /wdl_runner/wdl_runner.sh

resources:
  minimumRamGb: 1

Issue · Github
by Geraldine_VdAuwera

Issue Number
1989
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Derrick, I'm not familiar with Gemini. Can you clarify how the program normally expects to access this data, and whether there is a mechanism in place to set an alternate location?
  • decontideconti DFCIMember

    Hi, Geraldine.

    Gemini accesses the data via an option flag on the command line during installation of the software. The option flag would point to a directory. As it's considered bad practice to load all that data into a Docker, it's suggested to create a docker volume. So, what I would do on my local machine is install gemini:

    # Installs gemini with no data, with libraries and binaries for use in /usr/bin_dir/gemini
    python /usr/bin_dir/gemini-0.19.1/gemini/scripts/gemini_install.py --nodata  /usr/bin_dir/gemini /usr/bin_dir/gemini
    

    The next step is to then load the data:

    # Download necessary data sources to /usr/bin_dir/gemini/data
    # /data is implied when pointing to the directory you want the data.
    /usr/bin_dir/gemini-0.19.1/gemini/anaconda/bin/python /usr/bin_dir/gemini-0.19.1/gemini-0.19.1/gemini/install-data.py /usr/bin_dir/gemini
    # Then load some more data into the data directory
    gemini update --dataonly --extra cadd_score
    gemini update --dataonly --extra gerp_bp
    

    From here, if I'm following suggested practices with docker, I would instead download the data to my computer locally (i.e. non-docker), then mount the data instead to /usr/bin_dir/gemini/data, and change gemini's yaml configuration file to point the annotation directory to this mounted location (/usr/bin_dir/gemini/data).

    # mounts my local direcotry to the docker directory.
    # runs gemini to load a vcf
    docker run -t -v /my/local/gemini/data:/usr/bin_dir/gemini/data -i gemini gemini load -t VEP -v my_vcf.vcf my_gemini_db.db
    

    I hope that helps explain it.

    Thanks,
    Derrick

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @deconti, thanks for the explanation. I'm not familiar with this enough myself but I'll ask one of our engineers to help propose a solution.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @deconti @ChrisL We'll put in the Cromwell ticket

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • decontideconti DFCIMember

    Thanks, ChrisL.

    That answers my question. (I also did not realize I could not perform docker in docker via JES.)

    I'll either try getting Docker to incorporate all the downloaded data into one container. Or I'll just have to have Gemini run outside of Cromwell and WDL.

    Thanks, again.

  • g3n3g3n3 Member

    Hello GATK team, I had a similar issue where I was hoping I could mount a tmpfs type mount. I posted a note on the github issue that is related to this as well:

    https://github.com/broadinstitute/cromwell/issues/2190

    Mainly we are hoping that the docker can be launched with something like (from https://docs.docker.com/storage/tmpfs/#limitations-of-tmpfs-containers)

    $ docker run -d \
      -it \
      --name tmptest \
      --mount type=tmpfs,destination=/app \
      nginx:latest
    

    Were we can mount a tmpfs volume and declare its mount point on google cloud.

    We currently do this on our local cromwell runs by giving the submit a docker run with our own runtime parameter

    ${'--mount type=tmpfs,destination='+mount_tmpfs}

    This lets us use a ramdisk to unpack tens of thousands of files in seconds. I don't see a strait forward way to add this for the google cloud submit so if it can be supported in part of this feature request it would be great!

    Thanks,
    Jason

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @g3n3
    Hi Jason,

    Sorry for the delay. I was away at a workshop. Let me ask someone else form the team answer you soon.

    -Sheila

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    I'd suggest waiting for the WDL Directory type to be added, at which point whether the path gets localized via a direct "copy everything onto the VM" or a less heavyweight "mount path as a volume" could be a configuration/runtime/customization option.

    Warning! What I said above is true unless you're expecting this mount to be read/write, which I would say will probably not be coming. The WDL language and Cromwell engine are pretty heavily based on the assumption that the values they move around are immutable, so altering Directories in place as part of a command would probably cause more problems than it solves...

Sign In or Register to comment.