Latest Release: 10/09/18
Release Notes can be found here.

"Could not capture docker logs" error

RLCollinsRLCollins Harvard Medical SchoolMember
edited July 2017 in Ask the FireCloud Team

Hi,

I'm encountering an error I can't explain while testing a new method (MELT; see http://melt.igs.umaryland.edu/) on a single ~30X human whole-genome sample in FireCloud.

The error message returned in the Monitor tab is:
"message: Task MELTWorkflow.MELT:NA:1 failed. JES error code 10. Message: 15: Gsutil failed: Could not capture docker logs: failed to acquire logs: exit status 1"

The stdout and stderr logs for this job look completely normal, except the execution of the job appears to have suddenly stopped midway after running for >1 hour and exited without any obvious errors.

I wasn't able to find any leads for this error on Google or by searching here in the FireCloud forum.

The error message mentions Docker logs, but I've confirmed that the Docker image I deploy in the WDL runs successfully on my local machine, so I can't spot anything obviously wrong there either.

I've run this script on both the Broad and Partners Healthcare clusters on thousands of human WGS libraries previously without problems.

For reference, the Docker is on Dockerhub here:
https://hub.docker.com/r/rlcollins/melt/

I've also attached both the WDL and a script called from within the WDL. (Note: I had to upload both with a .txt extension as the forum kicked back errors when trying to upload files with .sh or .wdl extensions)

Please let me know if there's a clear solution to this error, or if there's any further information I can provide to help troubleshoot this bug. I'd appreciate any pointers or tips you can provide!

Thank you,

Ryan Collins
PhD Candidate, Talkowski Lab
MGH / Broad / HMS
[email protected]

Post edited by RLCollins on

Best Answer

Answers

  • RLCollinsRLCollins Harvard Medical SchoolMember

    Hi,

    Adding a brief update here:

    1) I've confirmed that this error is replicable when running the same workflow multiple times on the same sample, and that the same error is returned when run on different samples (all in the same workspace)

    2) I noticed it's returning a JES error code of 10, for which the only Google hit was Eric's post in this other issue posted to the FC forum: https://gatkforums.broadinstitute.org/firecloud/discussion/9851/cryptic-failure-messages. Could something similar be going on here, or are the errors totally separate?

    3) I've appended an example JES log to the bottom of this post.

    Thanks again,
    Ryan

    2017/07/27 17:11:35 I: Switching to status: pulling-image
    2017/07/27 17:11:35 I: Calling SetOperationStatus(pulling-image)
    2017/07/27 17:11:35 I: SetOperationStatus(pulling-image) succeeded
    2017/07/27 17:11:35 I: Writing new Docker configuration file
    2017/07/27 17:11:35 I: Pulling image "rlcollins/[email protected]:a0cc56eb841d7ae8b52ba70cdda6706fc2818bf923542f38739a046e70328e07"
    2017/07/27 17:13:05 I: Pulled image "rlcollins/[email protected]:a0cc56eb841d7ae8b52ba70cdda6706fc2818bf923542f38739a046e70328e07" successfully.
    2017/07/27 17:13:05 I: Switching to status: localizing-files
    2017/07/27 17:13:05 I: Calling SetOperationStatus(localizing-files)
    2017/07/27 17:13:06 I: SetOperationStatus(localizing-files) succeeded
    2017/07/27 17:13:06 I: Docker file /cromwell_root/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta maps to host location /mnt/local-disk/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.
    2017/07/27 17:13:06 I: Running command: sudo gsutil -q -m cp gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta /mnt/local-disk/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
    2017/07/27 17:13:35 I: Docker file /cromwell_root/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bai maps to host location /mnt/local-disk/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bai.
    2017/07/27 17:13:35 I: Running command: sudo gsutil -q -m cp gs://fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bai /mnt/local-disk/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bai
    2017/07/27 17:13:37 I: Docker file /cromwell_root/exec.sh maps to host location /mnt/local-disk/exec.sh.
    2017/07/27 17:13:37 I: Running command: sudo gsutil -q -m cp gs://fc-a447980a-ff46-4379-85fd-c540460400fa/c92d02c2-3d7e-4c14-b26b-b51e1b122520/MELTWorkflow/fae6050c-3dc0-4ebb-9c6f-15e4b1ff4b8a/call-MELT/exec.sh /mnt/local-disk/exec.sh
    2017/07/27 17:13:38 I: Docker file /cromwell_root/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai maps to host location /mnt/local-disk/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai.
    2017/07/27 17:13:38 I: Running command: sudo gsutil -q -m cp gs://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai /mnt/local-disk/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai
    2017/07/27 17:13:39 I: Docker file /cromwell_root/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bam maps to host location /mnt/local-disk/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bam.
    2017/07/27 17:13:39 I: Running command: sudo gsutil -q -m cp gs://fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bam /mnt/local-disk/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bam
    2017/07/27 17:30:29 I: Done copying files.
    2017/07/27 17:30:29 I: Switching to status: running-docker
    2017/07/27 17:30:29 I: Calling SetOperationStatus(running-docker)
    2017/07/27 17:30:31 I: SetOperationStatus(running-docker) succeeded
    2017/07/27 17:30:31 I: Setting these data volumes on the docker container: [-v /tmp/ggp-150205850:/tmp/ggp-150205850 -v /mnt/local-disk:/cromwell_root]
    2017/07/27 17:30:31 I: Running command: docker run -v /tmp/ggp-150205850:/tmp/ggp-150205850 -v /mnt/local-disk:/cromwell_root -e LINE1.final_comp.vcf=/cromwell_root/LINE1.final_comp.vcf -e __extra_config_gcs_path=gs://cromwell-auth-broad-ccdg-dev/fae6050c-3dc0-4ebb-9c6f-15e4b1ff4b8a_auth.json -e MELT.input_bam-0=/cromwell_root/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bam -e MELT.input_bam_index-0=/cromwell_root/fc-4cd921a5-44e8-4a11-a327-595cf63d51f9/Kathiresan_VIRGO_MESA_TAICHI_Hg19bams/VIR_00051.bai -e MELT.reference_index-0=/cromwell_root/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai -e exec=/cromwell_root/exec.sh -e ALU.final_comp.vcf=/cromwell_root/ALU.final_comp.vcf -e SVA.final_comp.vcf=/cromwell_root/SVA.final_comp.vcf -e MELT.reference-0=/cromwell_root/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta -e MELT-rc.txt=/cromwell_root/MELT-rc.txt rlcollins/[email protected]:a0cc56eb841d7ae8b52ba70cdda6706fc2818bf923542f38739a046e70328e07 /tmp/ggp-150205850
    
  • esalinasesalinas BroadMember, Broadie ✭✭✭

    hi @RLCollins

    I cannot say that I have seen this issue before....

    Do you confirm that the WDL has run on places other than firecloud successfully?

    Can you confirm/say if the docker's ENTRYPOINT and CMD are /bin/bash? If you set them both to /bin/bash does it make a difference?

    -eddie

  • RLCollinsRLCollins Harvard Medical SchoolMember
    edited July 2017

    Hi @esalinas, thanks for the reply.

    I confirmed today that I'm able to successfully run this WDL on my local machine (a Macbook pro w/16GB RAM running OSX 10.12.5). I was unable to test it on the Broad and Partners Healthcare clusters due to dependency issues (it appears that the Docker CLI isn't available as a dotkit on the Broad cluster or as a module on the PHS cluster, or at least not as far as I can tell).

    I also confirmed that the docker's CMD is ["/bin/bash"]. As evidence, here's the script.submit file generated by cromwell when I run this WDL locally:

    #!/bin/bash
    docker run \
      --rm -i \
       \
      --entrypoint /bin/bash \
      -v /Users/rlc/scratch/WDL_tests/cromwell-executions/MELTWorkflow/98dd6cbd-d0c2-4d21-8046-96e80f71c140/call-MELT:/cromwell-executions/MELTWorkflow/98dd6cbd-d0c2-4d21-8046-96e80f71c140/call-MELT \
      rlcollins/[email protected]:a0cc56eb841d7ae8b52ba70cdda6706fc2818bf923542f38739a046e70328e07 /cromwell-executions/MELTWorkflow/98dd6cbd-d0c2-4d21-8046-96e80f71c140/call-MELT/execution/script
    

    As a shot in the dark, I created a new FireCloud method & configuration and have launched test runs on two samples. In the offhand chance that those end up completing successfully, I'll update this thread, although there's no reason to believe that the error will suddenly be resolved.

    Any other ideas on what could be causing this issue?

    Thanks for your help,
    Ryan

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited July 2017

    hi @RLCollins,

    To explore the issue, I created a simple "hello world" WDL to see if I could reproduce the issue.

    The WDL did not reproduce the issue.

    task d
        {
    
        command <<<
        echo "hi" ; 
        rm -rf *
        >>>
    
        runtime {
            docker : "rlcollins/[email protected]:a0cc56eb841d7ae8b52ba70cdda6706fc2818bf923542f38739a046e70328e07"
            disks: "local-disk 1 HDD"
            memory: "0.1 GB"
            }
    
        }
    
    workflow w {
        call d
        }
    

    The WDL did run however ending with a green checkmark.

    This makes me speculate if there is something in the command-block causing this is to manifest? What about your runtime block? Can you post or share those? I wonder if this is related to any docker version? what about your WDL inputs and/or outputs?

    -eddie

  • RLCollinsRLCollins Harvard Medical SchoolMember

    Hi @esalinas,

    Thanks again for the rapid reply!

    Sure thing; here's the command-block:

    command {
        runMELT.sh \
        -L ${readLength} \
        -I ${insertSize} \
        ${input_bam} \
        ${reference} \
        ${coverage} \
        /MELT/MELTv2.0.2/ \
        /
      }
    

    The last argument is an output directory for the results files generated by runMelt.sh. Could this be causing an error? I'm not totally clear on how Cromwell handles working directories/etc, especially in combination with Docker containers, although I don't know how this would be related to the error message suggesting an issue with Docker logs being captured.

    And here's the runtime-block:

    runtime {
        docker: "rlcollins/melt"
        memory: "7 GB"
        cpu: 1
        disks: "local-disk 300 HDD"
        preemptible: 3
      }
    

    The two tests I just launched are still running, so I'll keep an eye on those.

    Thanks again for your help,
    Ryan

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    hi @RLCollins

    FC based on my experience doesn't do well with de-localizing directories. if you want to de-localize a directory, you should put it in a tarball (or zip or bz2 or tar or something) and then delocalize that (e.g. in the output block File myTarBallOfDir="my_dir.tar.gz" ).

    so it looks like for you, you could issue "tar -cvzf melted.tgz /MELT/MELTv2.0.2/" at the end
    and then in the output block have "File melted="melted.tgz" or something like that.

    Also, if I understand correctly your output directory is in the root at /MELT/MELTv2.0.2/
    on FC google execution computer, the root of the container's filesystem is mounted on a disk that is separate from the attached disk
    where files are written to in the current working directory. I wonder if the boot disk, mounted at "/"
    is being written to and running out of disk-space and that is causing the issue.

    two ideas if the running-out-of-disk-space is the cause of the issue : 1) increase boot disk size
    see https://github.com/broadinstitute/cromwell#boot-disk
    2) instead of writing output to "/MELT/MELTv2.0.2/" try writing it to ". /MELT/MELTv2.0.2/" (see the dot)
    so that it's written in the currrent-working directory.

    In either case, you'd have to either save the dir to a tarball or enumerate the directory contents indivudally or
    with a glob. I'd not recommend globs because my experience is that the data-entity-model doesn't work well
    with them currently.

    -eddie

  • RLCollinsRLCollins Harvard Medical SchoolMember
    edited July 2017

    Hi @esalinas,

    Great, this is really useful information.

    I ran into the issue of delocalizing directories earlier for a separate method, and in that case ended up tarring+gzipping the directory, which seemed to resolve the issue. I'll try something similar here.

    This gives me a few new angles to try -- thanks! I'll work through them this weekend and report back on the outcome.

    Again, thanks,
    Ryan

  • RLCollinsRLCollins Harvard Medical SchoolMember
    edited July 2017

    @esalinas Sounds good to me; that proposed explanation makes sense.

    I've just launched a test on two samples with the output being written to ./ instead of /.

    I'll let you know if it clears up the error message. Thanks again!

  • RLCollinsRLCollins Harvard Medical SchoolMember

    Hi @esalinas, thanks for the tip -- looks like writing to ./ instead of / fixed the problem, and the method is now running as expected.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @RLCollins Great to hear. Thanks for the update!

    -eddie

Sign In or Register to comment.