Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Localization via hard link has failed

I'm using cromwell 29 with docker, and I keep getting this error:

[2017-09-15 09:08:57,11] [warn] Localization via hard link has failed: 
/data/test/cromwell-executions/variantDiscovery/5bb6004a-dcf9-4459-84cf-7b7ecb35d960/call-Report/variantDiscoveryReport/bc64e2c9-fd27-429c-82b2-22458b63d9eb/call-plotBam/inputs/data/test/cromwell-executions/variantDiscovery/5bb6004a-dcf9-4459-84cf-7b7ecb35d960/call-dedup/shard-1/execution/102517-23.dedup.bam -> /data/test/cromwell-executions/variantDiscovery/a4726bef-8d45-4746-ac99-9cabf9dadd36/call-dedup/shard-1/execution/102517-23.dedup.bam: Operation not permitted

Because I am using docker, soft-linking is not possible, so cromwell keeps copying over all the files it needs. I also generate a report for every step of the analysis, so cromwell makes a copy of every single output file! Because of this localization problem, this effectively doubles the disk size of every analysis, and I cannot complete my analysis because I run out of harddisk space.

I found a similar issue on the forum here https://gatkforums.broadinstitute.org/wdl/discussion/9477/localization-via-hard-link-has-failed, but the solution is not very clear. I think EADG added a user to each docker image he uses? That's not an option for me since I do not control all the docker images I use (e.g the broadinstitute images for gatk and picard).

How can I resolve this issue? Creating a copy of every bam and fastq file every time it is needed is not acceptable in my situation, and I also don't want to stop using docker, for obvious reasons.

Tagged:

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    I believe if you don't specify a user when you docker run, the default is to run as root(!)

    If I had to guess, I'd say the docker-produced output files are being created by root and that whatever user is running cromwell is not able to hard link root's files them (did you try logging into the Cromwell server to see who owns the files and what happens when you try to make the link manually (eg ln <src> <dst>)?)

    One side-note, I'm pretty sure that hard-links only work if the filesystem you're using is local (eg if you've mounted some NFS drive, I don't know whether that will allow hardlinks)

    Since cromwell needs to be able to localize files for subsequent tasks, and you can't use copy or softlink, it seems like the only way forward is to force these files to be hard-linkable by Cromwell. A few things that you might try:

    1. Change the umask on your filesystem so that linking is allowed on the files that the docker images create?
    2. Update the task to broaden permissions on the files after it produces them (eg add chmod +wr ... to the end of every task command)?
    3. Find out if another user that already exists on the docker image you're using and run the commands as that user?
  • Redmar_van_den_BergRedmar_van_den_Berg Member
    edited September 2017

    Thanks for your response. I think I will update each task to include a chmod +wr for all the output files, since that is the most general solution.

    Are there any plans to address this issue in cromwell itself? From looking at the code on github, changing cromwell/backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala

    to include chmod 666 * after the INSTANTIATED_COMMAND would solve this as wel.

    From

      /** A bash script containing the custom preamble, the instantiated command, and output globbing behavior. */
      def commandScriptContents: String = {
        jobLogger.info(s"`$instantiatedCommand`")
    
        val cwd = commandDirectory
        val rcPath = cwd./(jobPaths.returnCodeFilename)
        val rcTmpPath = rcPath.plusExt("tmp")
    
        val globFiles = backendEngineFunctions.findGlobOutputs(call, jobDescriptor)
    
        s"""|#!/bin/bash
            |tmpDir=$$(mktemp -d $cwd/tmp.XXXXXX)
            |chmod 777 $$tmpDir
            |export _JAVA_OPTIONS=-Djava.io.tmpdir=$$tmpDir
            |export TMPDIR=$$tmpDir
            |$commandScriptPreamble
            |(
            |cd $cwd
            |INSTANTIATED_COMMAND
            |)
            |echo $$? > $rcTmpPath
            |(
            |cd $cwd
            |${globManipulations(globFiles)}
            |)
            |SCRIPT_EPILOGUE
            |mv $rcTmpPath $rcPath
            |""".stripMargin.replace("INSTANTIATED_COMMAND", instantiatedCommand).replace("SCRIPT_EPILOGUE", scriptEpilogue)
      }
    

    to

      /** A bash script containing the custom preamble, the instantiated command, and output globbing behavior. */
      def commandScriptContents: String = {
        jobLogger.info(s"`$instantiatedCommand`")
    
        val cwd = commandDirectory
        val rcPath = cwd./(jobPaths.returnCodeFilename)
        val rcTmpPath = rcPath.plusExt("tmp")
    
        val globFiles = backendEngineFunctions.findGlobOutputs(call, jobDescriptor)
    
        s"""|#!/bin/bash
            |tmpDir=$$(mktemp -d $cwd/tmp.XXXXXX)
            |chmod 777 $$tmpDir
            |export _JAVA_OPTIONS=-Djava.io.tmpdir=$$tmpDir
            |export TMPDIR=$$tmpDir
            |$commandScriptPreamble
            |(
            |cd $cwd
            |INSTANTIATED_COMMAND
            |chmod 666 *
            |)
            |echo $$? > $rcTmpPath
            |(
            |cd $cwd
            |${globManipulations(globFiles)}
            |)
            |SCRIPT_EPILOGUE
            |mv $rcTmpPath $rcPath
            |""".stripMargin.replace("INSTANTIATED_COMMAND", instantiatedCommand).replace("SCRIPT_EPILOGUE", scriptEpilogue)
      }
    

    This would resolve this issue for all users/docker images. The drawback is that another evil user might change your cromwell output files without you noticing.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    Great, let us know whether or not it works for you for the next person who comes along! :smile:

    I think on a single-user environment making all outputs world-readable might be appropriate. But, in a world where the FS might be a cloud object store or an NFS shared filesystem and workflow outputs might be protected data... I'd be very nervous about making any assumption regarding opening files up to the world. Especially without telling anyone and giving them no option to opt-out!

    Having said that, I'll ping @Geraldine_VdAuwera regarding the users on the images, since maybe it's possible to iterate towards a better user experience in our "you should be using this" gatk/picard dockers.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi all, we can definitely consider tweaking our dockers to improve the experience, eg by creating a non-root user on the docker. Do I understand correctly that that is what would be helpful here? If we create a generic user with some arbitrary name, will you be able to make use of that?

  • kshakirkshakir Broadie, Dev ✭✭

    Hi @Redmar_van_den_Berg,

    Re:

    changing cromwell/backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala to include chmod 666 * after the INSTANTIATED_COMMAND would solve this

    In the source snippet you mentioned, the SCRIPT_EPILOGUE that runs after INSTANTIATED_COMMAND is configurable. See this section in the example conf.

    Something like script-epilogue = "chmod -R a+r * && sync" may work for your setup without requiring any changes to the existing cromwell code base.

  • Hi @Geraldine_VdAuwera,
    I've tested adding a user to the docker image, and it seems to work, but it is not a reliable solution. Linux uses UID to assign ownership to files. So if you are the first user on the system that runs cromwell (UID=1000), and you add a random user inside the docker (which also gets UID=1000), the UID's of the files will match and hard linking etc will work. However, if you add another user to the system running cromwell, the UID's will not match, and you get the same problems we get when the files are owned by root.

    So I don't think it makes sense to add another user inside the docker to solve this specific problem.

  • @ChrisL said:
    Great, let us know whether or not it works for you for the next person who comes along! :smile:

    I think on a single-user environment making all outputs world-readable might be appropriate. But, in a world where the FS might be a cloud object store or an NFS shared filesystem and workflow outputs might be protected data... I'd be very nervous about making any assumption regarding opening files up to the world. Especially without telling anyone and giving them no option to opt-out!

    Having said that, I'll ping @Geraldine_VdAuwera regarding the users on the images, since maybe it's possible to iterate towards a better user experience in our "you should be using this" gatk/picard dockers.

    Thanks!

    I agree that this is not a good solution. However, it should be noted that the output files are already world readable, otherwise the cromwell process could copy them to the input folder of the next task (since they are owned by root). The reason that cromwell cannot hardlink them is a security precaution on most systems that forbids hardlinking files that the user does not have read and write access to, since hardlinking system files can be a security risk.

    So I guess the more general version of the problem is that there is no way to set the file permissions of outputs from a docker image. In my case, this means that cromwell cannot hardlink them because I do not have write permissions on files owned by root.
    In the case of users on a shared filesystem, there is no way to protect your sensitive data from other users, since the files must be world-readable for cromwell to copy them over to the next task.

  • I think I have found a solution that will solve most of the permissions problems without modifying docker images or adding commands to make outputs world read/writable.

    It turns out you can specify a numerical UID as docker_user, and the output files will be owned by UID:root, even when that user is not present in the docker image. Once you are back 'outside' the docker image, the numerical UID will successfully be mapped to the user that runs cromwell, and that user will be the owner of all the files created by the docker container.

    On my (single user) system, the files are still world readable, which is not a problem for me. I would guess that to solve this the owner of the docker image would need to change the default umask inside the image.

    task getUID {
        command {
            echo $UID
        }
        output {
        Int UID=read_int(stdout())
        }
    }
    
    
    task test {
        String image
        Int UID
        command {
            script.sh
        }
        runtime {
            docker: "${image}"
        docker_user: "${UID}"
        }
        output {
            File out="output.txt"
        }
    }
    
    workflow wf {
        call getUID
        call test {
        input: image  = "nvwalab/usertest:0.1",
            UID   = getUID.UID
        }
        output {
            File out= test.out
        }
    }
    
  • @kshakir
    Thanks for that link, that is definitely the correct place to change the permissions, instead of pasting chmod .... inside of every command block, which is what I was doing. I have found a solution that does not require any chmod commands (see my previous post), but I will keep that setting in mind for future reference.

  • @kshakir

    Is there a variable to access the directory where docker runs the analysis that can be used with the epilogue?

    The script that is generated by cromwell to run the analysis has a number of the following lines:
    cd /cromwell-executions/variantDiscovery/6659a8cc-20ae-4cd2-a03c-b9035154d2f0/call-trimmomatic/shard-2/execution

    to change to the folder where the actual analysis is performed. Is there a way to use a variable for this path in the epilogue script? I want to automate generating a md5 sibling file for all the outputs to speed up call caching, but for that I need to access the folder where the outputs are written to.

  • kshakirkshakir Broadie, Dev ✭✭

    Can you create a new forum post describing more of what you're trying to do / what you expect for caching / what you're currently seeing? There may be an opportunity for a feature that could benefit the wider community.

    In the meantime, the script epilogue is not guaranteed to run in the execution directory, but I submitted a PR such that it will in the next version. Then the epilogue may just record $PWD if it wants.

Sign In or Register to comment.