Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(Howto) Run GATK4 in a Docker container

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

1. Install Docker

Follow the relevant link below depending on your computer system; on Mac and Windows, select the "Stable channel" download. Run through the installation instructions and initial setup page; they are very straightforward and should only take you a few minutes (not counting download time).
We have included instructions below for all steps after that first page, so you shouldn't need to go to any other pages in the Docker documentation. Frankly their docs are targeted at people who want to do things like run web applications on the cloud and can be quite frustrating to deal with.

Click here for Mac

Click here for Windows

Full list of supported systems and their install pages


2. Get the GATK4 container image

Go to your Terminal (it doesn't matter where your working directory is) and run the following command.

docker pull broadinstitute/gatk:4.beta.6

Note that the last bit after gatk: is the version tag, which you can change to get a different version than what we've specified here.

The GATK4 image is quite large so the download may take a little while if you've never done this before. The good news is that next time you need to pull a GATK4 image (e.g. to get another release), Docker will only pull the components that have been updated, so it will go faster.


3. Start up the GATK4 container

There are several different ways to do this in Docker. Here we're going to use the simplest invocation that gets us the functionality we need, i.e. the ability to log into the container once it's running and execute commands from inside it.

docker run -it broadinstitute/gatk:4.beta.6

If all goes well, this will start up the container in interactive mode, and you will automatically get logged into it. Your terminal prompt will change to something like this:

[email protected]:/gatk#

At this point you can use classic shell commands to explore the container and see what's in there, if you like.


4. Run a GATK4 command in the container

The container has the gatk-launch script all set up and ready to go, so you can now run any GATK or Picard command you want. Note that if you want to run a Picard command, you need to use the new syntax, which follows GATK conventions (-I instead of I= and so on). Let's use --list to list all tools available in this version.

./gatk-launch --list

The output will start with a usage message (shown below) then a full list of tools and their summary descriptions.

Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
Running:
    /gatk/build/install/gatk/bin/gatk --help
USAGE:  <program name> [-h]

Once you've verified that this works for you, you know you can run any GATK4 commands you want. But before you proceed, there's one more setup thing to go through, which is technically optional but will make your life much easier.


5. Use a mounted volume to access data that lives outside the container

This is the final piece of the puzzle. By default, when you're inside the container you can't access any data that lives on the filesystem outside of the container. One way to deal with that is to copy things back and forth, but that's wasteful and tedious. So we're going to follow the better path, which is to mount a volume in the container, i.e. establish a link that makes part of the filesystem visible from inside the container.

The hitch is that you can't do this after you started running the container, so you'll have to shut it down and run a new one (not just restart the first one) with an extra part to the command. In case you're wondering why we didn't do this from the get-go, it's because the first command we ran is simpler so there's less chance that something will go wrong, which is nice when you're trying something for the first time.

To shut down your container from inside it, you can just type exit while still inside the container:

exit

That should stop the container and take you back to your regular prompt. It's also possible to exit the container without stopping it (a move called detaching) but that's a matter for another time since here we do want to to stop it. You'll probably also want to learn how to clean up and delete old instances of containers that you no longer want.

For now, let's focus on starting a new instance of the GATK4 container, specifying in the following command what is your particular container ID and the filesystem location you want to mount.

docker run -v ~/my_project:/gatk/my_data -it broadinstitute/gatk:4.beta.6

Here I set the external location to be an existing directory called my_project in my home directory (the key requirement is that it has to be an absolute path) and I'm setting the mount point inside the container's /gatk directory. The name of the mount point can be the same as the mount directory, or something completely different; the main constraint is that it should not conflict with an existing directory, otherwise that would make the existing directory unattainable.

Assuming your paths are valid, this command starts up the container and logs you into it the same way as before; but now you can see by using ls that you have access to your filesystem. So now you can run GATK commands on any data you have lying around. Have fun!

Tagged:

Comments

  • steve1steve1 Member

    Does this work on IBM Power8 and Power9 systems? Or is a different container needed?

    https://gatkforums.broadinstitute.org/gatk/discussion/4833/speed-up-haplotypecaller-on-ibm-power8-systems

    Issue · Github
    by Sheila

    Issue Number
    2994
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • steve1steve1 Member

    Also is there an equivalent container available for Singularity?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @steve1
    Hi,
    Let me check with the team and get back to you.
    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @steve1 The GATK4 container should work on any system that supports running on Docker. We do not have experience with Power8 but I know the IBM Power team is very keen to provide support for running GATK on their system; I recommend you reach out to them. Let me know if you do not have a representative to contact; I can dig up the contact info of the team I have interacted with.

    We do not provide Singularity containers at this time, sorry.

  • @Geraldine_VdAuwera
    Hi Geraldine,

    Can I or how to read and write files under google cloud bucket for example, gs:bucket/folder/f.bam within the docker? The docker is running under a VM on google cloud.

    Thank you.

  • bshifawbshifaw Member, Broadie, Moderator admin

    @Bingley, you can download the file from the google bucket to your docker using gsutil cp gs://<path to your file> then edit as would normally would with a text editor in the docker (vim, emacs, etc.).
    Note: Your docker should have gsutil installed.

  • steve1steve1 Member
    edited June 2018

    @Geraldine_VdAuwera IBM can provide pre-built binaries for GATK however I have yet to find any way to make a Docker or Singularity container out of it, because no one with IBM servers will give me user privileges required to build containers. Also, I am not even sure where the IBM GATK binaries are hosted, they do not seem to be available publicly. It would be much easier if someone could provide pre-made containers for it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @steve1 I would recommend asking IBM to provide their binaries on a container. We are not able to distribute such containers because we do not have access to IBM servers on which to test them.

  • elcorteganoelcortegano University of EdinburghMember
    I'm wondering if its possible to use this docker image in a non-interactively way (e.g. without opening the image with -i -t arguments). The objective is to use GATK docker image as any other CLI program, without having to open the image (eg. for a more convenient use in pipelines).

    This can be usually done by mounting a volume and setting and entrypoint. This has worked for my with Picard image for instance, but when using GATK image, like below, I get an error:

    ```
    docker run --rm -v $PWD:/root --entrypoint="java" broadinstitute/gatk -jar gatk.jar HaplotypeCaller -I input.bam -O output.g.vcf -R reference.fa.gz
    ```

    The program GATK in fact opens, but says it cannot find the reference genome file:

    ```
    A USER ERROR has occurred: The specified fasta file (file:///gatk/reference_genome/reference.fa.gz) does not exist.
    ```
    This is unlikely to be the real problem however, because the reference is in fact under $PWD and is mounted fine for other images.

    Any ideas on how to achieve this?

    Thank you
Sign In or Register to comment.