Instructions for GATK workshops
Preparing for the workshop
To follow these instructions and attend the workshop, you will need to have a basic understanding of the meaning of the following words and command-line operations. If you are unfamiliar with any of the following, you should consult a more experienced colleague or your system administrator if you have one. There are also many good online tutorials you can use to learn the necessary notions.
- Basic Unix environment commands
- Binary / Executable
- Adding a binary to your path (optional)
- Command-line shell, terminal or console
- Software library
- Todo list for the impatient
- Platform requirements (hardware and environment)
- Software packages to install
- Workshop materials (data, worksheets and slides)
1. Todo list for the impatient
- Set up GATK4 Docker as described here
Download and install additional tools:
Download workshop materials:
2. Platform requirements
See Quickstart for general GATK software requirements.
Important note about MS Windows: We try to support participants running on Windows systems, and we find most of the workshop exercises run well using Docker on Windows. However we often encounter technical issues that are specific to Windows, and some of these issues currently have no solution. For that reason, we cannot guarantee full success with Windows, and we encourage you to make arrangements to use a Linux system for the workshop.
The analyses we run in workshops are designed to run quickly and on small datasets, so they can run on single-processor machines and should not require more than 4G of RAM. For file storage, plan on 10G of space minimum.
We use Docker to ensure that all workshop participants are working with the same environment. This greatly reduces time wasted dealing with environment differences or dependency-related issues. Participants who choose to work with a different setup will be responsible for adapting instructions accordingly.
Be sure to install and configure the Docker environment correctly before the workshop by following this procedure, including pulling the GATK4 docker image. It is a very large file and may take a long time to download, so this must be done in advance.
Running on remote servers is not recommended as we will use desktop software such as IGV. Participants who choose to run on a remote server will be responsible for setting up with network mounts or transferring files to work with desktop software.
3. Software packages to install
- Genome Analysis Toolkit (GATK) and Picard
- IGV genome browser
- RStudio IDE and R libraries ggplot2 and gsalib
- Cromwell and WDLTool
Genome Analysis Toolkit (GATK) and Picard
Hopefully, if you're reading this, you're already acquainted with the purpose of the GATK. As described in more detail in the Quickstart guide, you can either download the GATK package and run it directly in the "traditional" way, or you can run it from within a Docker container. In our workshops, we use Docker, so you will need to follow this procedure to install Docker and get the GATK container image installed appropriately. This may seem a bit more complicated up front but it eliminates the majority of problems we see people struggle with.
IGV genome browser
The Integrated Genomics Viewer is a genome browser that allows you to view BAM, VCF and other genomic file information in context. It has a graphical user interface that is very easy to use and can be downloaded for free (though registration is required) from this website. We encourage you to read through IGV's very helpful user guide, which includes many detailed tutorials that will help you use the program most effectively.
RStudio IDE and R libraries ggplot2 and gsalib and dependencies
Download the latest version of RStudio IDE. The webpage should automatically detect what platform you are running on and recommend the version most suitable for your system.
Follow the installation instructions provided. Binaries are provided for all major platforms; typically they just need to be placed in your Applications (or Programs) directory. Open RStudio and type the following command in the console window to install the required packages:
install.packages(c("ggplot2", "reshape", "gplots", "gridExtra", "gsalib"))
This will download and install these packages so you can use them in the workshop. You can also do this directly in R if you prefer.
Cromwell and WomTool
For the pipelining section of the workshop, you will need to get the jar files for the Cromwell execution engine and a utility called WomTool. We also recommend a text editor called SublimeText.
Cromwell is an execution engine capable of running scripts written in WDL, describing data processing and analysis workflows. The latest release can be downloaded here in the form of a pre-compiled jar.
WomTool is a utility package that provides accessory functionality for writing and running WDL scripts, including syntax validation and input template generation. You can download the latest release of the pre-compiled jar here.
WDL can be written with any text editing program, but for this workshop we will be using SublimeText. It is a simple but effective program, and you can download it here. This program also allows syntax highlighting for WDL, which you can install by following the instructions here.
4. Workshop materials (data, worksheets and slides)
We provide a bundle containing test datasets, worksheets with the instructions for the hands-on exercises, and all slide decks presented in the workshop. You can find GATK workshop bundles organized by YYMM (year-month) in the GATK Workshops directory. If you are registered for an upcoming workshop where you will be using your own laptop, you MUST download the bundle before coming to the workshop. If we update the bundle ahead of the workshop, you will receive a notification with a reminder to download the new version.
For those attending pipelining-only workshops, the workshop bundle will differ. Please check your email for where to find these materials.