Introduction to the GATK Best Practices

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited January 9 in Best Practices Workflows

This document provides important context information about how the GATK Best Practices are developed and what are their limitations.


Contents

  1. What are the GATK Best Practices?
  2. Analysis phases
  3. Experimental designs
  4. Workflow scripts provided as reference implementations
  5. Scope and limitations
  6. What is not GATK Best Practices?
  7. Beware legacy scripts

1. What are the GATK Best Practices?

Reads-to-variants workflows used at the Broad Institute.

The GATK Best Practices provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencing machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses. Wherever we can, we try to provide guidance regarding experimental design, quality control (QC) and pipeline implementation options, but please understand that those are dependent on many factors including sequencing technology and the hardware infrastructure that are at your disposal, so you may need to adapt our recommendations to your specific situation.


2. Analysis phases

Although the Best Practices workflows are each tailored to a particular application (type of variation and experimental design), overall they follow similar patterns, typically comprising two or three analysis phases depending on the application.

(1) Data Pre-processing is the first phase in all cases, and involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

(2) Variant Discovery proceeds from analysis-ready BAM files and produces variant calls. This involves identifying genomic variation in one or more individuals and applying filtering methods appropriate to the experimental design. The output is typically in VCF format although some classes of variants (such as CNVs) are difficult to represent in VCF and may therefore be represented in other structured text-based formats.

(3) Depending on the application, additional steps such as filtering and annotation may be required to produce a callset ready for downstream genetic analysis. This typically involves using resources of known variation, truthsets and other metadata to assess and improve the accuracy of the results as well as attach additional information.


3. Experimental designs

Whole genomes. Exomes. Gene panels. RNAseq

These are the major experimental designs we support explicitly. Some of our workflows are specific to only one experimental design, while others can be adapted to others with some modifications. This is indicated in the workflow documentation where applicable. Note that any workflow tagged as applicable to whole genome sequence (WGS) and others is presented by default in the form that is suitable for whole genomes, and must be modified to apply to the others as recommended in the workflow documentation. Exomes, gene panels and other targeted sequencing experiments generally share the same workflow for a given variant type with only minor modifications.


4. Workflow scripts provided as reference implementations

Less guesswork, more reproducibility.

It's one thing to know what steps should be run (which is what the Best Practices tell you) and quite another to set up a pipeline that does it in practice. To help you cross this important gap, we provide the scripts that we use in our own pipelines as reference implementations. The scripts are written in WDL, a workflow description language designed specifically to be readable and writable by humans without an advanced programming background. WDL scripts can be run on Cromwell, an open-source execution engine that can connect to a variety of different platforms, whether local or cloud-based, through pluggable backends. See the Pipelining Options section for more on the Cromwell + WDL pipelining solution.

We also make all the GATK Best Practices workflows available in ready-to-run form on FireCloud, our cloud-based analysis portal, which you can read more about here.

Note that some of the production scripts we provide are specifically optimized to run on the Google Cloud Platform.
Wherever possible we also provide "generic" versions that are not platform-specific.


5. Scope and limitations

We can't test for every possible use case or technology.

We develop and validate these workflows in collaboration with many investigators within the Broad Institute's network of affiliated institutions. They are deployed at scale in the Broad's production pipelines -- a very large scale indeed. So as a general rule, the command-line arguments and parameters given in the documentation are meant to be broadly applicable (so to speak). However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data, organisms or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values.

In addition, several key steps make use of external resources, including validated databases of known variants. If there are few or no such resources available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution. On the bright side, if you solve them for your field, you will be a hero to a generation of researchers and your citation index will go through the roof.


6. What is not GATK Best Practices?

Lots of workflows that people call GATK Best Practices diverge significantly from our recommendations.

Not that they're necessarily bad. Sometimes it makes perfect sense to diverge from our standard Best Practices in order to address a problem or use case that they're not designed to handle. The canonical Best Practices workflows (as run in production at the Broad) are designed specifically for human genome research and are optimized for the instrumentation (overwhelmingly Illumina) and needs of the Broad Institute sequencing facility. They can be adapted for analysis of non-human organisms of all kinds, including non-diploids, and of different data types, with varying degrees of effort depending on how divergent the use case and data type are. However, any workflow that has been significantly adapted or customized, whether for performance reasons or to fit a use case that we do not explicitly cover, should not be called "GATK Best Practices", which is a term that carries specific meaning. The correct way to refer to such workflows is "based on" or "adapted from" GATK Best Practices. When in doubt about whether a particular customization constitutes a significant divergence, feel free to ask us in the forum.


7. Beware legacy scripts

Trust, but verify.

If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single parameter in the commands. If anything is unfamiliar to you, you should find out what it does. If you can't find it in the documentation, ask us in the forum. It's one or two hours of your life that can save you days of troubleshooting on the tail end of the pipeline, so please protect yourself by being thorough.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • Hi all,

    I have a problem in the HaplotypeCaller Step for Whole Genome Sequencing Data. after run this step i getting the following error massage (In the attached image).Due to two errors created (in the image), the run step begins and remains in this position for more than two days without having an output? any idea what this might be?

    Code i run:

    java –Xmx64g -jar /home/m.rafiepour222/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar –R /home/m.rafiepour222/GCF_000298355.1_BosGru_v2.0_genomic.fa -T HaplotypeCaller -I /home/m.rafiepour222/SRR3112417/SRR3112417.sort.rmdup.bam -o /home/m.rafiepour222/SRR3112417/SRR3112417.raw.snps.indels.g.vcf

    Best Regard

    Mostafa

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @m_rafiepour222
    Hi Mostafa,

    I am answering here. No need to post twice.

    -Sheila

  • Hi Sheila,

    I'm sorry to repeat Question because it's been for before.

    Best Reagard

    Mostafa

  • DrdeepakDrdeepak indiaMember

    Hi this is Dr. Deepak
    I am using the GATK for non model organism (an insect) for which i have three samples say s1, s2, s3
    I make a database of snps using hard filter a vcf file (produced from the gvcf to vcf file). Now the question is for base recalibration should i use each sample separately that will produce three bsqr table in each iteration or i combine all the three in a single command (by multiple I option) so that it will produce a single bsqr table in each iteration.

  • DrdeepakDrdeepak indiaMember

    Dear GATK members i will highly appreciate you if you will be able to provide the response of my query.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Drdeepak
    Hi Dr. Deepak,

    The important thing is that you have your RGIDs properly identified. BQSR will run per lane. You should input all reads that were sequenced on the same lane to BQSR.

    Sheila

  • robertbrobertb torontoMember

    Is BQSR with GATK4 still recommended best practice? I'm getting warnings about spark not being production ready in bcbio pipeline, wondering what to do about that? thanks - RObert

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @robertb
    Hi Robert,

    BQSR is still part of the Best Practices. Spark tools are still in beta, but the non-spark tools are production ready.

    -Sheila

  • lidialidia Member

    Hi GATK team!

    I am writing to ask for the recommended DNA sequencing depth coverage for the SNP caller to be reliable. Where can I find that information?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @lidia
    Hi,

    We don't have exact numbers, but I can tell you that GATK tools are tested on 30X Illumina data.

    -Sheila

Sign In or Register to comment.