Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Failure to delocalize files when using my own docker image

Hi-

I am having some trouble with file delocalization in firecloud when using my own docker images.

So for the setup, I made a dockerfile that contains all of my source code (an autobuild from a github repo) that I wanted to use for analyses in Firecloud. Then in my method in Firecloud, this docker image would be called in the runtime block, allowing access to my source code (basically a bunch of R scripts that would be run). I hard-carded paths within each wdl task to the corresponding script within the docker container.

When I try to run my method in Firecloud, I am seeing some weird behavior in where files are being moved/written:

message: Task fullPipe.getarray:NA:1 failed. JES error code 5. Message: 10: Failed to delocalize files: failed to copy the following files: "/mnt/local-disk/getarray-rc.txt -> gs://fc-fa093e72-dbcb-4028-ae82-609a79ced51a/3d32ccf4-28ba-43d8-8704-7c87d8f34be7/fullPipe/ae7b05d4-cc26-451b-8a07-00b5b12d26a8/call-getarray/getarray-rc.txt (cp failed: gsutil -q -m cp -L /var/log/google-genomics/out.log /mnt/local-disk/getarray-rc.txt gs://fc-fa093e72-dbcb-4028-ae82-609a79ced51a/3d32ccf4-28ba-43d8-8704-7c87d8f34be7/fullPipe/ae7b05d4-cc26-451b-8a07-00b5b12d26a8/call-getarray/getarray-rc.txt, command failed: CommandException: No URLs matched: /mnt/local-disk/getarray-rc.txt\nCommandException: 1 file/object could not be transferred.\n)"

From the log file, the task seems to be completing but failing when copying files:

2017/08/29 18:09:53 I: Running command: iptables -I FORWARD -d metadata.google.internal -p tcp --dport 80 -j DROP
2017/08/29 18:09:53 I: Setting these data volumes on the docker container: [-v /tmp/ggp-146399440:/tmp/ggp-146399440 -v /mnt/local-disk:/cromwell_root]
2017/08/29 18:09:53 I: Running command: docker run -v /tmp/ggp-146399440:/tmp/ggp-146399440 -v /mnt/local-disk:/cromwell_root -e fc-d960a560-7e5c-4083-b61e-b2ea71ae5b14/passgt.minDP10-gds500/chunk2.freeze4.chrALL.pass.gtonly.minDP10.genotypes.gds=/cromwell_root/fc-d960a560-7e5c-4083-b61e-b2ea71ae5b14/passgt.minDP10-gds500/chunk2.freeze4.chrALL.pass.gtonly.minDP10.genotypes.gds -e __extra_config_gcs_path=gs://cromwell-auth-amp-t2d-op/ae7b05d4-cc26-451b-8a07-00b5b12d26a8_auth.json -e getarray.gdsfilesin-0=/cromwell_root/fc-d960a560-7e5c-4083-b61e-b2ea71ae5b14/passgt.minDP10-gds500/chunk1.freeze4.chrALL.pass.gtonly.minDP10.genotypes.gds -e getarray.gdsfilesin-1=/cromwell_root/fc-d960a560-7e5c-4083-b61e-b2ea71ae5b14/passgt.minDP10-gds500/chunk2.freeze4.chrALL.pass.gtonly.minDP10.genotypes.gds -e exec=/cromwell_root/exec.sh -e getarray-rc.txt=/cromwell_root/getarray-rc.txt -e fc-d960a560-7e5c-4083-b61e-b2ea71ae5b14/passgt.minDP10-gds500/chunk1.freeze4.chrALL.pass.gtonly.minDP10.genotypes.gds=/cromwell_root/fc-d960a560-7e5c-4083-b61e-b2ea71ae5b14/passgt.minDP10-gds500/chunk1.freeze4.chrALL.pass.gtonly.minDP10.genotypes.gds tmajarian/[email protected]:b0b54996d86746d199493a94dbc92751c4a1d9399c7898e58174c84d35fe44fe /tmp/ggp-146399440
2017/08/29 18:09:54 I: Switching to status: delocalizing-files
2017/08/29 18:09:54 I: Calling SetOperationStatus(delocalizing-files)
2017/08/29 18:09:54 I: SetOperationStatus(delocalizing-files) succeeded
2017/08/29 18:09:54 I: Docker file /cromwell_root/getarray-rc.txt maps to host location /mnt/local-disk/getarray-rc.txt.
2017/08/29 18:09:54 I: Running command: sudo gsutil -q -m cp -L /var/log/google-genomics/out.log /mnt/local-disk/getarray-rc.txt gs://fc-fa093e72-dbcb-4028-ae82-609a79ced51a/3d32ccf4-28ba-43d8-8704-7c87d8f34be7/fullPipe/ae7b05d4-cc26-451b-8a07-00b5b12d26a8/call-getarray/getarray-rc.txt
2017/08/29 18:09:55 E: command failed: CommandException: No URLs matched: /mnt/local-disk/getarray-rc.txt
CommandException: 1 file/object could not be transferred.
 (exit status 1)

This problem seems to only be with the docker files/images that I create; the task called above completes when a different docker is used (one that was build by someone else). The docker image is public also: tmajarian/topmed. Also, here is the wdl that I am using:

task getarray {
    Array[File] gdsfilesin

    command {
        ls -lh ${sep = ' ' gdsfilesin}
    }

    output {
        Array[File] gdsfilesout = gdsfilesin}

     runtime {
           docker: "tmajarian/[email protected]:1b10a60f8ad47316b71e51ea864fa1b68fb0585cc5ac190f827573e6eaa0348e"
     }

}

task common_ID {
        File gds
        File ped
        String idcol
        String label

        command {
                R --vanilla --args ${gds} ${ped} ${idcol} ${label} < /src/workflows/singleVariantFull/commonID.R
        }

        meta {
                author: "jasen jackson"
                email: "[email protected]"
        }

        runtime {
           docker: "tmajarian/[email protected]:1b10a60f8ad47316b71e51ea864fa1b68fb0585cc5ac190f827573e6eaa0348e"
           disks: "local-disk 100 SSD"
           memory: "3G"
        }

        output {
                File commonIDstxt = "${label}.commonIDs.txt"
                File commonIDsRData = "${label}.commonIDs.RData"
        }
}

task assocTest {
    File gds
    File ped
    File GRM
    File commonIDs
    String label
    String colname
    String outcome
    String outcomeType
    String covariates

    command {
        R --vanilla --args ${gds} ${ped} ${GRM} ${commonIDs} ${colname} ${label} ${outcome} ${outcomeType} ${covariates} < /src/workflows/singleVariantFull/assocSingleVar.R
    }

    meta {
        author: "jasen jackson; Alisa Manning, Tim Majarian"
        email: "[email protected]; [email protected], [email protected]" 
    }

    runtime {
        # docker: "tmajarian/[email protected]:1b10a60f8ad47316b71e51ea864fa1b68fb0585cc5ac190f827573e6eaa0348e"
        docker: "tmajarian/topmed:latest"
        disks: "local-disk 100 SSD"
        memory: "30G"
    }

    output {
        File assoc = "${label}.assoc.RData"
    }
}

task summary {
    Array[File] assoc
    String pval
    String label
    String title

    command {
        R --vanilla --args ${pval} ${label} ${title} ${sep = ' ' assoc} < /src/workflows/singleVariantFull/summarySingleVar.R
    }

    runtime {
        docker: "tmajarian/[email protected]:1b10a60f8ad47316b71e51ea864fa1b68fb0585cc5ac190f827573e6eaa0348e"
        disks: "local-disk 100 SSD"
        memory: "30G"
    }

    output {
        File mhplot = "${label}.mhplot.png"
        File qqplot = "${label}.qqplot.png"
        File topassoccsv = "${label}.topassoc.csv"
        File allassoccsv = "${label}.assoc.csv"
    }
}

workflow fullPipe {
    Array[File] genFiles
    File this_ped
    File this_kinshipGDS
    String this_label
    String this_colname
    String this_outcome
    String this_outcomeType
    String this_covariates
    String this_pval 
    String this_title

    call getarray { input: gdsfilesin=genFiles }

    call common_ID {
        input: gds=getarray.gdsfilesout[0], ped=this_ped, idcol=this_colname, label=this_label
    }

    scatter ( this_genfile in getarray.gdsfilesout ) {
        call assocTest {
            input: gds = this_genfile, ped = this_ped, GRM = this_kinshipGDS, commonIDs = common_ID.commonIDsRData, colname = this_colname, outcome = this_outcome, outcomeType = this_outcomeType, covariates = this_covariates,  label=this_label
        }

    }

    call summary {
        input: assoc = assocTest.assoc, pval=this_pval, label=this_label, title=this_title
    }

    output { 
        File mhplot=summary.mhplot
        File qqplot=summary.qqplot
        File allassoc=summary.allassoccsv
        File topassoc=summary.topassoccsv 
    }

}

Any input would be totally awesome.

-Tim

Tagged:

Best Answer

Answers

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    It looks like the issue is in getarray. Is that right?

    Do you confirm that when you run the tasks that you get stderr and stodout files and that the contents of them suggest that your pipeline ran? The pipe seems to consist of listing he files. If you get that both of those are in the task directory and that one or both of them suggest successful execution then my guess is that you have some directory, path, docker, or disk related issue.

    Are you saying that the same command block but with another docker image runs fine?

    Make sure that you have enough disk space to write out the output files. Try adding a "df -h" or similar command to make sure the attached disk has sufficient disk space.

    Do you have a "workdir" set in your dockerfile? Could that be causing an issue?

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    Hi @tmajarian

    Also look in the exec.sh in the bucket in the directory for the task. That script shows the creation of the rc.txt file. It may provide some understanding to what's going on. The exec.sh is what is run. See "-e exec" in the JES log

    -Eddie

  • tmajariantmajarian Member, Broadie

    No stderr or stdout files were created (at least they were not copied to output bucket). And yes, the command block has worked with another docker image and produced the desired results.

    I'm thinking that it's a path/disk space issue? No workdir is specified in the docker file. Is it possible that the input files are being copied to the boot disk and filling it up?

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    hi @tmajarian ,

    would it be possible for you to share with me the workspace? If so, can you make me READER with my email "[email protected]"?

    I'm now guessing/wondering if it's an input-related issue? Do you use the same inputs as the other run that you said was successful? Do you have permissions to read the inputs?

    You confirmed that neither the stderr nor the stdout files were in the bucket. You do confirm however that the JES log was there right? Do you confirm that each of the files in the array was successfully localized to the VM? Did you get an operations ID? I'm also wondering if the task even ran at all.

    -eddie

  • gordon123gordon123 BroadMember, Broadie

    How big are your input files? If they total more than 6-7GB or so you need to specify the boot disk size separately to prevent the inputs from filling the disk... the boot disk defaults to 10GB if left unspecified. More detail in this previous forum post.

    https://gatkforums.broadinstitute.org/firecloud/discussion/8803/moving-docker-disk-image-off-boot-disk

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hey @esalinas and @tmajarian

    I am having a similar error to yours, "JES error code 5. Message: 10: Failed to delocalize files: failed to copy the following files:....command failed: CommandException: No URLs matched: /mnt/local-disk/..."

    Please let us know if you find the root cause.

  • gordon123gordon123 BroadMember, Broadie

    A correction to my earlier post - the input files are not stored on the boot disk. However, the docker image is, as well as any files you write to places outside of the cwd output directory, eg /tmp. If you write a lot to /tmp, you need to expand the boot disk, or better yet move where you write your temp files.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @gordon123 @tmajarian If I understand/recall correctly, your task is doing very little and simply does an "ls" on the files, right?

    My thinking is that the output is very much linked to the input in the WDL.
    In the input you have

        Array[File] gdsfilesin
    

    and in the output you have :

        output {
            Array[File] gdsfilesout = gdsfilesin}
    

    which almost makes the task seem redundant. If the input is the same as the output, why is the task there? Can you skip the task and feed the input direct to the next task? I wonder if this "tight coupling" is causing the issue?

    I'm interested to see if there's an operations ID and to see the whole JES log which might have clues to the cause of the issue.

  • tmajariantmajarian Member, Broadie
    edited September 2017

    @esalinas I'm not sure that I can actually share the workspace. However, I can share another where I am having the same exact issue, albeit with a different task in the wdl. I'll share this workspace with you; checkout any failure under the config 'topmed-t2d/epacts-3.2.6-sing-var-mkl_*'.

    And to the gdsarray task: the task is necessary to import Array[File]'s within the firecloud data model. (See https://gatkforums.broadinstitute.org/firecloud/discussion/comment/41151#Comment_41151)

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited September 2017

    HI @tmajarian

    Do you confirm that this bucket directory represents an instance of the issue : db864a5a-2a86-4f0a-a60e-9efb78bd29e9/w/7893d854-8c5c-44f6-b46d-8bff1343f18b/call-singlevar ?

    I note that you're calling tabix and EPACTS in this case..... under "/tmp2". I also see you have

    TMPDIR=`pwd` 
    

    (and not TMPDIR=/tmp)....despite that I unfortunately have a suspicion about disk filling up....

    Per the runtime attributes here
    https://github.com/broadinstitute/cromwell#runtime-attributes
    can you add plenty of both attached disk
    (here with "disksize")

     47                 memory: "${memory} GB"
     48                 disks: "local-disk ${disksize} SSD"
     49         }
    
    

    but also add plenty of boot-disk as seen here :
    https://github.com/broadinstitute/cromwell#boot-disk

    I suggest these to try to rule out disk-usage (getting filled up) as a cause. I know the stderr never says "Error : no space left on device" or anything like that, but I want to see if the disk getting filled up can be ruled out. On that note, I see that both the stdout and stderr files are present but they are both empty

    Consider "dstat" which can be used for disk-space-usage monitoring : http://dag.wiee.rs/home-made/dstat/

    Dstat can be installed (under ubuntu 16.04) like so

    apt-get update && apt-get install -y python sudo dstat
    

    Can be run and the PID retrieved; used to “kill -9” it for task termination. (Use "ps" and "grep" to acquire its PID)

    dstat --freespace  --nocolor -cdngylmt   --output dstat.log 1>dstat.out 2>dstat.err &
    

    save the LOG (note ".log"!!!!) file and examine it for inspecting disk-usage vs time. I belive "*.log" files are downloaded every 5 min to the bucket so use ".log" for output extension!

    -eddie

  • gordon123gordon123 BroadMember, Broadie
    edited September 2017

    *.log is not automatically downloaded, the dstat.log file will need to be specified explicitly as an output file.

    Also, note scripts here to start and stop the dstat processes, intended to be called from the wdl command block:
    https://github.com/broadinstitute/firecloud_developer_toolkit/blob/master/algutil/monitor_start.py
    https://github.com/broadinstitute/firecloud_developer_toolkit/blob/master/algutil/monitor_stop.py

  • tmajariantmajarian Member, Broadie
    edited September 2017

    @esalinas I think that I may have it pinned down? I ran a similar method config (that succeeded) without the specification of

    disks: "local-disk ${disksize} SSD"
    

    I had inherited this code and (as someone with 2 weeks of experience with wdl/firecloud) assumed that "local-disk" was a key word, rather than an actual path to be accessed during runtime. In all of my new methods (which all also use my docker images), I added this disk specification. So it seems (to me) that the actual problem might be here?

    I'll get back to you using dstat to see about disk usage.

  • gordon123gordon123 BroadMember, Broadie

    BTW - in my informal testing, HDD performed about the same as SSD, as for many of our workflows the bulk of the I/O is reading linearly through a BAM file. Also, HDD is 1/4 the price of SSD.

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited September 2017

    @tmajarian did you increase the disk (attached disk or boot?) size and see if it ran okay? did you remove it and it ran okay?

    -eddie

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @tmajarian,

    1. When disks: "local-disk ${disksize} SSD isn't specified, then the value for disk size defaults to local-disk 10 SSD.

    2. You mentioned earlier you run task getArray because you're trying to import gdsfilesin/genFiles to the FireCloud data model. However, only workflow level outputs are written to the data model. And it doesn't seem like getArray.gdsfilesout is being declared as a workflow output. Are you trying to create an attribute "genFiles" or "gdsFiles" for the entity you're running on?

    3. I ran a test workflow using the original docker image you referenced:

    workflow w {
       Array[File] inArray
       call t {input: inArray = inArray }
    
    }
    
    task t {
      Array[File] inArray
      command {
        ls -lh ${sep=' ' inArray }
      }
      runtime {
        docker: "tmajarian/[email protected]:1b10a60f8ad47316b71e51ea864fa1b68fb0585cc5ac190f827573e6eaa0348e"
      }
    }
    

    This workflow ran successfully and I have a stdout() file and the return code is 0. I agree with @esalinas and @gordon123 that it's a disk space issue as your docker is compatible with the command you tested with a much smaller/fewer input files. Let me know if you have any recent submissions I can help investigate.

  • tmajariantmajarian Member, Broadie

    @esalinas :
    Changed the boot size to 20GB (larger than both the docker image and input files) and the attached disk size to 150GB (much larger than any result file to be generated). Still had the same failure.

    Also, outputting the dstat log is presenting a problem.. The workflow fails in delocalizing the log, so it never makes it to the bucket.

    Hi @Ruchi, I think the difference between what you ran there and my task is the lack of output section? The reason that I put an output in the workflow and task is because intermediate files may be generated that do not need to be pushed back to the data model. (basically the pipeline produces both raw results and some figures/tables; the path to the figure/table outputs are then populated into the data model).

  • gordon123gordon123 BroadMember, Broadie

    1) don't be stingy with disk capacity when debugging... make both the boot disk and attached disk larger than you need, eg set both to 150GB or 500GB or something. Once things work you can tune the sizes lower to save money, based in part on the dstat logs.

    2) The args for dstat can tell it to emit to stdout, which I assume is getting delocalized.

Sign In or Register to comment.