Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GitHub basics for researchers

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited February 1 in Dictionary

This guide introduces select elements of the broadinstitute/gatk GitHub repository to researchers on the GATK forum who we have pointed to the repo for any variety of reasons and who are unfamiliar with GitHub.

The labels in the screenshot number the seven elements this article covers.

image

Understanding the first three elements (Sections 1–3) should enable researchers to (i) interpret for example the status of a feature request or bug fix for a particular GATK release version and (ii) be involved in the discussion that drives GATK development forward.

The remaining four elements (Sections 4-7) are of interest to those who wish to read about the mathematics behind GATK algorithms, view versioned WDL-format pipelines for workflows under recent development, learn how to use engine features, e.g. streaming from Google Cloud Storage, and build GATK from the sourcecode.

Jump to a section

  1. Issues: Submit new or discuss existing bugs and feature requests
  2. Pull requests: Make or track changes to the codebase
  3. # releases: Download releases and read release notes
  4. Branch: Control the version of the code in view
  5. docs: Mathematical whitepapers on select algorithms
  6. scripts: Tested versioned WDL pipeline scripts
  7. README.md: Instructions to build and run GATK in the required environment

1. Issues: Submit new or discuss existing bugs and feature requests

Issue tickets are where discussion happens and where plans are set to make changes to the codebase.

image

image

  • When the issue ticket has an Open label, the discussion remains unresolved.

image

  • When the issue ticket has a Closed label, consider the discussion closed. It's okay to comment in a closed ticket; know it is possible to reopen closed issues.

Just because an issue ticket discusses plans or has a Closed status, does not necessarily mean the GATK has or will implement that discussed within. Skim the discussion and look for associated pull requests, which are often referred to as PRs, and their status (screenshot below). If you are unclear on any point, ask for clarification by writing a comment in the issue ticket. You will need a GitHub account and be signed in to do so.

image

Here's an example issue ticket where the community drove the implementation of a feature, specifically the --include-non-variant-sites option of GenotypeGVCFs: https://github.com/broadinstitute/gatk/issues/2865.


2. Pull requests: Make or track changes to the codebase

Read the discussion in the pull request and any associated issue ticket for specifics on the changes.

image

  • An Open status indicates the changes are ongoing and being worked on away from the master codebase, which is the main code.

image

  • A Merged status means the master code reflects the changes. To reiterate, the master code branch will immediately reflect the changes upon merging a PR. This does not mean the latest GATK release reflects these changes. To figure this out, note the date of the merge. A GATK release that comes after this merge will have the changes. A GATK release before this merge date will not contain the changes.

image

  • An associated issue ticket appears like so and clicking on the link will open it.

Here's an example pull request that pairs with the previous example issue ticket: https://github.com/broadinstitute/gatk/pull/5219.


3. # releases: Download releases and read release notes

In the overview screenshot we see 35 releases for GATK4. The releases page presents releases in reverse-chronological order, so the latest release is at top.

image

  • The release date is immediately underneath the release version tag.

image

  • Click the gatk-4.x.x.x.zip link under Assets to download the pre-built release. When you expand the zip bundle, you will get a folder named gatk-4.x.x.x containing a working launch script you use to invoke tools from the commandline. Typing /path/to/gatk-4.x.x.x/gatk --list into a terminal prompt will list the available tools in the toolkit as well as their production status, whether experimental EXPERIMENTAL Tool, in beta testing BETA Tool, or fit for production (no label).

image

  • Each release comes with release notes. Release notes are the definitive place to learn about changes in GATK. Our engineers curate the notes to be meaningful and human-readable and derive them from git commit messages, a source of more technical detail that this article does not cover. Often, a bullet point in the release notes will have a link to the relevant pull request. If you need clarification on some point, please ask in the associated issue or on the GATK forum.

4. Branch: Control the version of the code in view

image

The branch is set to master by default, which reflects the latest development to the broadinstitute/gatk codebase. To view a snapshot of the code for a particular version of GATK, click the Branch button, then switch to the Tags tab. Selecting a tag version, e.g. 4.0.0.0, will allow you to travel back in time to the codebase as it looked for that particular release. This is useful, e.g. if you are looking for WDL pipeline scripts that work for past versions of GATK4 (see Section 6).


5. docs: Mathematical whitepapers on select algorithms

The PDFs within this folder and subfolders outline the mathematics behind select GATK algorithms. If the GATK forum seems sparse on mathematical details, that is because it is not set up to display complex LaTeX equations. The whitepapers are provided by the generosity of GATK methods developers. Be sure to take into consideration the datestamps associated with the articles, as development takes priority over documentation and the mathematical details can fall behind the latest algorithmic improvements.


6. scripts: Tested versioned WDL pipeline scripts

For certain GATK4 workflows, the developers maintain working WDL pipeline scripts for every release. See Section 4 for instructions on accessing tagged versioned scripts.

image

Take for example the mutect2_wdl directory. It contains pipeline scripts for creating a Mutect2 PoN, for running Mutect2 on a tumor-normal pair, etc. The view will show the development or master codebase by default. The following portions of the highlighted script illustrate a difference between the v4.0.0.0 and the v4.1.0.0 WDLs, for each workflow's invocation of their respective M2 tasks.

Notice the URL elements that differ--the tag version and the highlighted lines. We see the latter pipeline defines a number of additional parameters, e.g. artifact_prior_table, that are not present in the earlier pipeline. If we check the details of the respective M2 tasks, then we also see differences. In this way, if you are testing out workflows using broadinstitute/gatk repository WDL scripts, you should be sure to match to the version of the toolkit.


7. README.md: Instructions to build and run GATK in the required environment

The README.md is a document that the repository landing page displays, below the list of folders and files. For the broadinstitute/gatk repository, it presents a plethora of information that a Table of Contents at top organizes.

Of interest to researchers are the following sections.


Post edited by shlee on
Sign In or Register to comment.