Accessing local files in host machine while running WDL/CROMWELL script / Mount volume automatically

YatrosYatros Seattle, WA, USAMember

Hello,

I'm testing a WDL/CROMWELL script locally with the GATK docker to combine several GVCF files with GenomicsDBImport before moving it to the CLOUD. I'm trying to access certain files located in the host machine while running the script, but I'm not able to mount the volume with these files so that I can find them inside the docker environment.

This is the part of my script (a modified version of the GATK joint-discovery-gatk4.wdl script) I'm stuck at:

  scatter (idx in range(length(unpadded_intervals))) {
    # the batch_size value was carefully chosen here as it
    # is the optimal value for the amount of memory allocated
    # within the task; please do not change it without consulting
    # the Hellbender (GATK engine) team!
    call ImportGVCFs {
      input:
        sample_name_map = sample_name_map,
        interval = unpadded_intervals[idx],
        workspace_dir_name = "genomicsdb",
        disk_size = medium_disk,
        docker_image = gatk_docker,
        batch_size = 50,
        gatk_path = gatk_path
    }
...
}

task ImportGVCFs {
  File sample_name_map
  String interval
  String workspace_dir_name
  String java_opt
  String docker_image
  String gatk_path
  Int disk_size
  String mem_size
  Int preemptibles
  Int batch_size

  command <<<
    set -e

    rm -rf ${workspace_dir_name}

    # The memory setting here is very important and must be several GB lower
    # than the total memory allocated to the VM because this tool uses
    # a significant amount of non-heap memory for native libraries.
    # Also, testing has shown that the multithreaded reader initialization
    # does not scale well beyond 5 threads, so don't increase beyond that.
   ${gatk_path} --java-options "${java_opt}" \
    GenomicsDBImport \
    --genomicsdb-workspace-path ${workspace_dir_name} \
    --batch-size ${batch_size} \
    -L ${interval} \
    --sample-name-map ${sample_name_map} \
    --reader-threads 5

    tar -cf ${workspace_dir_name}.tar ${workspace_dir_name}

  >>>
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "2"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
  output {
    File output_genomicsdb = "${workspace_dir_name}.tar"
  }
}

My JSON file looks like this:

{  
  "##_COMMENT1": "INPUT GVCFs & COHORT -- DATASET-SPECIFC, MUST BE ADAPTED",
  "JointGenotyping.callset_name": "TEST",
  "JointGenotyping.sample_name_map": "/mnt/user/TEST/GVCFS/Samples.sample_map",
....
  "##_COMMENT4": "DOCKERS", 
  "JointGenotyping.python_docker": "python:2.7",
  "JointGenotyping.gatk_docker": "broadinstitute/gatk:4.0.1.0",
  "JointGenotyping.gatk_path": "/gatk/gatk"
}

My sample_map looks like this:

    S1  /mnt/user/TEST/GVCFS/S1.vcf.gz
    S2  /mnt/user/TEST/GVCFS/S2.vcf.gz
    S3  /mnt/user/TEST/GVCFS/S3.vcf.gz
    S4  /mnt/user/TEST/GVCFS/S4.vcf.gz
    S5  /mnt/user/TEST/GVCFS/S5.vcf.gz

I know how to mount a volume in a single Docker container:

docker run -v /mnt/user/TEST/GVCFS/:/mnt/mydata -it broadinstitute/gatk:4.0.1.0

The problem is how to do this automatically when you are running a WDL/CROMWELL script.

I already saw this post from last year, but I haven't found a solution for it:

https://github.com/broadinstitute/cromwell/issues/2190

I don't want to copy local files to the container every time I want to check an updated pipeline. Is there an easy way to assess this issue?

Thanks,

Yatros

Best Answer

Answers

  • Disclaimer first: I'm a WDL/Cromwell newb.

    When I've used a docker image in a task, Cromwell automatically mounts the task's directory. Here's some (somewhat cleaned up) Cromwell stdout from a task call that shows this:

    executing: docker run \
      --cidfile docker_cid \
      --rm -i \
      --entrypoint /bin/bash \
      -v cromwell-executions/bam2readcount_batch/102a37c4-1638-4d43-90c1-aebad1ccec44/call-bam2readcount/shard-0:/cromwell-executions/bam2readcount_batch/102a37c4-1638-4d43-90c1-aebad1ccec44/call-bam2readcount/shard-0 \
    my_image /cromwell-executions/bam2readcount_batch/102a37c4-1638-4d43-90c1-aebad1ccec44/call-bam2readcount/shard-0/execution/script
    

    So, if you can get your files in the task's input directory, Crowmwell will mount it automatically. My solution to this is to declare the files as input arguments to the task. (It looks like this operation means Cromwell copies the files which seems problematic for large files...)

Sign In or Register to comment.