Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Unable to access jarfile when running on docker and local computer

SystemSystem Administrator admin
This discussion was created from comments split from: (How to) Run the GATK4 Docker locally and take a look inside.

Comments

  • Angry_PandaAngry_Panda Member
    edited August 2018

    Dear GATK team:

    Thanks very much, Amazing docker tutorial.

    I was using docker and local computer to run best practice for data-preprocessing according this link: https://github.com/gatk-workflows/gatk4-data-processing for hg38. I deleted the runtime part. I got error report:

    [2018-08-14 10:28:58,02] [error] WorkflowManagerActor Workflow c83e2d4f-03e8-48a7-9571-629c1b15651a failed (during ExecutingWorkflowState): Job PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem:16:1 exited with return code 127 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /Users/shuanglu/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/c83e2d4f-03e8-48a7-9571-629c1b15651a/call-SamToFastqAndBwaMem/shard-16/execution/stderr.
     /Users/shuanglu/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/c83e2d4f-03e8-48a7-9571-629c1b15651a/call-SamToFastqAndBwaMem/shard-16/execution/script: line 32: samtools: command not found
    /Users/shuanglu/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/c83e2d4f-03e8-48a7-9571-629c1b15651a/call-SamToFastqAndBwaMem/shard-16/execution/script: line 30: /usr/gitc/bwa: No such file or directory
    Error: Unable to access jarfile /usr/gitc/picard.jar
    

    I used
    docker run -i -t broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 and ls , found there is a picard file in this path.
    I totally confused and don't know how to fix the error.

    Post edited by bshifaw on
  • SheilaSheila Broad InstituteMember, Broadie admin

    @Angry_Panda
    Hi,

    I will ask someone on the team to get back to you.

    -Sheila

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    @Angry_Panda can you attach the stderr log from shard-16/execution/stderr?

  • Angry_PandaAngry_Panda Member
    edited August 2018

    @Tiffany_at_Broad said:
    @Angry_Panda can you attach the stderr log from shard-16/execution/stderr?

    Thx very much for @Sheila and @Tiffany_at_Broad your reply:
    this is my stderr log from shard-16/execution/stderr:

    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/ac1b257a-b883-47c2-a124-c1dec5ca2c7f/call-SamToFastqAndBwaMem/shard-16/tmp.53e34863
    22:44:20.602 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/gitc/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Sun Aug 19 22:44:21 UTC 2018] SamToFastq INPUT=/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/ac1b257a-b883-47c2-a124-c1dec5ca2c7f/call-SamToFastqAndBwaMem/shard-16/inputs/-1729239788/wgs_ubam-NA12878_24RG-small-HK35M.3.NA12878.interval.filtered.query.sorted.unmapped.bam FASTQ=/dev/stdout INTERLEAVE=true INCLUDE_NON_PF_READS=true    OUTPUT_PER_RG=false COMPRESS_OUTPUTS_PER_RG=false RG_TAG=PU RE_REVERSE=true CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Sun Aug 19 22:44:21 UTC 2018] Executing as [email protected] on Linux 4.9.93-linuxkit-aufs amd64; OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2~bpo8+1-b14; Deflater: Intel; Inflater: Intel; Picard version: 2.16.0-SNAPSHOT
    [Sun Aug 19 22:44:24 UTC 2018] picard.sam.SamToFastq done. Elapsed time: 0.07 minutes.
    Runtime.totalMemory()=3014656000
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.samtools.SAMException: Error in writing fastq file /dev/stdout
            at htsjdk.samtools.fastq.BasicFastqWriter.write(BasicFastqWriter.java:66)
            at picard.sam.SamToFastq.writeRecord(SamToFastq.java:356)
            at picard.sam.SamToFastq.doWork(SamToFastq.java:206)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
    /cromwell-executions/PreProcessingForVariantDiscovery_GATK4/ac1b257a-b883-47c2-a124-c1dec5ca2c7f/call-SamToFastqAndBwaMem/shard-16/execution/script: line 33:    11 Exit 1                  java -Dsamjdk.compression_level=5 -Xms3000m -jar /usr/gitc/picard.jar SamToFastq INPUT=/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/ac1b257a-b883-47c2-a124-c1dec5ca2c7f/call-SamToFastqAndBwaMem/shard-16/inputs/-1729239788/wgs_ubam-NA12878_24RG-small-HK35M.3.NA12878.interval.filtered.query.sorted.unmapped.bam FASTQ=/dev/stdout INTERLEAVE=true NON_PF=true
            12 Killed                  | /usr/gitc/bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta /dev/stdin - 2> >(tee wgs_ubam-NA12878_24RG-small-HK35M.3.NA12878.interval.filtered.query.sorted.unmapped.unmerged.bwa.stderr.log >&2)
            13 Done                    | samtools view -1 - > wgs_ubam-NA12878_24RG-small-HK35M.3.NA12878.interval.filtered.query.sorted.unmapped.unmerged.bam
    stderr (END)
    
    Post edited by bshifaw on
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Thanks @Angry_Panda the log states that there is an error in writing fastq file /dev/stdout
    Curious, when you say you are using a docker, are you using a mounted volume to access data that lives outside of the container (Step 5 in this article)? I am consulting a colleague on further recommendations.

  • Angry_PandaAngry_Panda Member
    edited August 2018

    @Tiffany_at_Broad , thx very much for your fast reply. I am really a green hand. When I am saying I use docker, I mean I pulled docker images and I checked docker sample in my laptop which said docker is running, then I use:
    java -jar /Users/angrypanda/Documents/Biosoft/cromwell/cromwell-34.jar run /Users/angrypanda/Documents/Bio/gatk4_best_practice/data_pre-processing/processing-for-variant-discovery-gatk4.wdl --inputs /Users/angrypanda/Documents/Bio/gatk4_best_practice/data_pre-processing/processing-for-variant-discovery-gatk4.hg38.wgs.inputs.json

    Post edited by bshifaw on
  • bshifawbshifaw Member, Broadie, Moderator admin
    edited August 2018

    Hi @Angry_Panda

    What's inside your json file? Just want to make sure the inputs files to run the workflow have been downloaded to your local machine and the paths to those files are listed in your input json file.

    Also, what are the resources of the machine you are running the workflow on? If the disk size of the machine is to small then the tool may have difficulty writing to the disk. If you look through the wdl, the suggested size is indicated by thedisk_sizevariable listed for the task. The SamToFastqAndBwaMem task uses the flowcell_medium_disk variable which is set to be 100G.

    Is there anything written inside the stdout file?

    Also if your new to running the git workflows, the following turutrial may help How to Execute Workflows from the gatk-workflows Git Organization

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2018

    Hi @Angry_Panda,

    If you have too many concurrent threads/shards each writing to /dev/stdouton your laptop, then this can be an issue. The SamToFastq step is writing to /dev/stdout, and based on shard-16/execution/stderr, it appears the off-the-shelf WDL you are using has at least 17 threads. This is a lot for a laptop to handle. I would suggest you process one or two BAMs at a time or move to the cloud. You can monitor memory usage with utility tools like top.

  • Hi @bshifaw , thanks very much for you reply.
    yes, I already downloaded the inputs files and adjust its absolute path in json file.

    I use my MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports), processor 2,3 GHz Intel Core i5, memory 8 GB 2133 MHz LPDDR3, startup disk Macintosh HD, and I checked my storage, 132GB available of 251GB.

    there is nothing inside the stdout file

    Do you think my computer is enough for running this program or if have to use a cloud service, what I have is this: https://research.csc.fi/cpouta.

  • Hi @shlee :
    Thanks very much for you reply. Sounds reasonable. about the concurrent threads, I know nothing about it. Do you mean I should only offer 1 or 2 bam input file to the json file or I can adjust some parameter to control wdl running threads?

    now I only have the access for this cloud: https://research.csc.fi/cpouta. there are many setup I can choose, for runing GATK4 best practice, eg: data preprocessing and germline variant calling, what kind of setting is enough?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Angry_Panda,

    I'm not familiar with the details of the parameters the WDL protocol allows you to adjust. @bshifaw can comment more on that. For example, a question to ask is whether there is a function to tell Cromwell to run the threads consecutively instead of concurrently given your laptop setup.

    I'm not familiar with your cloud setup, and those on the GATK support team can only comment superficially, if at all. However, I do see your cloud service is in Finland, where I believe researchers have access to national compute servers. You should know that you can run Cromwell on servers, e.g. SGE clusters.

    If I were to run this WDL on my laptop, I'd give it only one or two input BAMs. I recommend once you become comfortable with the workflow and are ready to run in production (on large data), then you switch to running on the cloud or a large cluster, where you can process a large number of samples concurrently in the time you expect a single sample to complete.

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited August 2018

    Hi @Angry_Panda

    Your question was moved from the docker tutorial to the GATK Forum where user questions are addressed.

    Sounds like the laptop disk space would be enough to perform the task but the 8 GB memory size may be to small. The suggested memory size for the task is 14GB as listed in the input json file"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.mem_size": "14 GB" .

    May I again suggest the alternative which is to run the workflow outside a docker container like mentioned in this tutorial This will probably be your best bet in the long run because the processing-for-variant-discovery-gatk4.wdl uses more than one docker to complete the workflow. For example, though you are running the workflow in broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 which is required for SamToFastqAndBwaMem. By the time it reaches the next task (MergeBamAlignment) the docker version requirements will have changed to broadinstitute/gatk:4*.

    It's also easier not having to spin up your own docker container to run a workflow. Following these instructions allows you to bypass needing to edit your wdl file to remove the runtime attribute block for each task. Cromwell is actually smart enough to ignore runtime parameter it doesn't need to run locally but still uses the docker attribute to spin up the specified docker container to complete the task. Thus, cromwell will spin up the required container for each task. Lastly, you wouldn't need to worry about whether you mounted all the required files needed to run the workflow. Which again cromwell will handle for you.

    Post edited by bshifaw on
  • Hi @bshifaw , thx for you reply. Now I try to follow the tutorial which offered by you. "Running Workflows Locally" part. I was trying to run it with docker in my Pouta cloud virtual machine. Basically I can use it as my laptop: RAM 29.3GB, VCPUs 8 VCPU, Disk 900GiB. I followed it step by step. but I got fault report. I have 1 questions:
    Inside of the wdl, there is docker images: broadinstitute/gatk:latest, but in this tutorial didn´t mentioned what should I do besides just adjust json file then via comman line use cromwell to run wdl with json inputs. What else should I do, like mount file? What I did is:
    sudo -i
    docker pull broadinstitute/gatk:latest
    service docker stop
    service docker start
    [[email protected] ~]# java -jar /home/cloud-user/gatk-workflows/cromwell-33.1.jar run /home/cloud-user/gatk-workflows/seq-format-validation/validate-bam.wdl -i /home/cloud-user/gatk-workflows/seq-format-validation/validate-bam.inputs.json

    my error report:
    [error] WorkflowManagerActor Workflow 4b3a59f5-3b0d-4578-8047-bf1fd550ec03 failed (during ExecutingWorkflowState): Job ValidateBamsWf.ValidateBAM:0:1 exited with return code -1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /root/cromwell-executions/ValidateBamsWf/4b3a59f5-3b0d-4578-8047-bf1fd550ec03/call-ValidateBAM/shard-0/execution/stderr.
    Could not retrieve content: /root/cromwell-executions/ValidateBamsWf/4b3a59f5-3b0d-4578-8047-bf1fd550ec03/call-ValidateBAM/shard-0/execution/stderr

    I checked stderr, it is empty. then I checked stderr.background:
    [[email protected] execution]# more stderr.background
    /bin/bash: /cromwell-executions/ValidateBamsWf/4b3a59f5-3b0d-4578-8047-bf1fd550ec03/call-ValidateBAM/shard-0/execution/script: Permission denied

  • bshifawbshifaw Member, Broadie, Moderator admin

    You don't have to mount files, cromwell will start your docker and mount the files for you.

    Looks like your having permission problem:
    /cromwell-executions/ValidateBamsWf/4b3a59f5-3b0d-4578-8047-bf1fd550ec03/call-ValidateBAM/shard-0/execution/script: Permission denied

    Cromwell will need root access in your VM to read/write files and directories as mentioned in this comment.
    https://gatkforums.broadinstitute.org/wdl/discussion/comment/51349#Comment_51349

  • Angry_PandaAngry_Panda Member
    edited September 2018

    Hi @bshifaw thx for your reply. I did use root right to run the cromwell. I sudo -i firstly, before I run it. I checked the comment, but it seems like didn't mentioned how to solve it.
    I noticed that in my local laptop, docker images, shows: broadinstitute/gatk latest
    in my Pouta VM, sudo -i, docker images, shows: docker.io/broadinstitute/gatk latest
    I don't know the difference or does it cause problems or not.

    Post edited by Angry_Panda on
  • Hi @shlee, thanks for you reply.

    I am trying to build my VM in Pouta cloud offered from csc.fi.
    I have a little trouble when choosing the RAM and floating IPs. I checked the google cloud guide for running gatk4 best practice. It didn't said how many RAM I need. Now I choosed 48 VCPUs. 1TB storage.
    and 32G RAM. And it said In-use IP Addresses: 51 (minimum 2) , now I have 2. I don't know why I need multiple IP address? To running the best practice, should I asking for more Floating IP address, like 10? And how many RAM should I asked? I did notice ingatk4-germline-snps-indels json file it said: "JointGenotyping.SNPsVariantRecalibratorCreateModel.mem_size": "104 GB"

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited September 2018

    Hi @Angry_Panda

    Would you mind sharing the contents of execution folder? For example : /cromwell-executions/ValidateBamsWf/4b3a59f5-3b0d-4578-8047-bf1fd550ec03/call-ValidateBAM/shard-0/execution/* , also a report of ls -lR on the execution folder.

    Post edited by bshifaw on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Angry_Panda,

    To best help you debug settings you should be using, would you mind creating a FireCloud account? If you are new, then there is a $300 cloud credit that you can get when you sign up. Within FireCloud, you should be able to use a workspace that is pre-configured to run your workflow of interest, including ValidateSamFile and share that workspace with @bshifaw. The preconfigured workflows show memory settings for the example data that is linked and you can extrapolate from there. I believe also Cromwell allows interpretation of certain WDL functions that calculate required memory from input file size. Sharing of the workspace will help reduce the back-and-forth here and hopefully allow us to pinpoint the issue.

    I myself have gotten the Permission denied error running WDL pipelines on a cloud compute VM. For example, this can happen if you have a mounted volume on the VM across which Cromwell is trying to access files. In this case, you must change some Cromwell configurations to use soft-links instead of hard-links. Solving these types of errors can be frustrating and we empathize with your situation. Thank you for being patient. We have a separate forum to answer WDL and Cromwell questions and perhaps spending ten minutes to see if another researcher has worked out parameters for a similar situation might be fruitful.

    I did notice ingatk4-germline-snps-indels json file it said: "JointGenotyping.SNPsVariantRecalibratorCreateModel.mem_size": "104 GB"

    Again, these are good rules-of-thumb parameters to start with, especially if you are using the example data we provide. For your own data, you should extrapolate based on your data size.

    I hope @bshifaw is able to solve your case.

  • Dear @shlee and @bshifaw , thanks very much for your patient and help. I already started to use firecloud and successfully ran the small tutorial wdl offered by bshifaw in pervious with my local computer and cloud Pouta VM.
    What I did to make this tutorial worked right is:
    I add my current user into docker user group and I used chmod +x and chmod+w to change the permission of the input file.
    Now, I will move to the bigger one, try to run the best practice for data-preprocessing on my cloud VM, 24cPCUs and 117.2G RAM and 900G storage.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for sharing your solution @Angry_Panda. Glad to hear things are working.

Sign In or Register to comment.