The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Best Practices for Variant Discovery in DNAseq

edited May 2015

This is our recommended workflow for calling variants in DNAseq data from cohorts of samples, in which steps from data processing up to variant calling are performed per-sample, and subsequent steps are performed jointly on all the individuals in the cohort.

The workflow is divided in three main sections that are meant to be performed sequentially:

• Variant discovery: from reads (BAM files) to variants (VCF files)
• Refinement and evaluation: genotype refinement, functional annotation and callset QC

### Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Local Realignment Around Indels; and Base Quality Score Recalibration (BQSR); performed in that order.

#### Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using BWA mem to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

#### Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

#### Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

### Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

#### Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs. If there are more than a few hundred samples, we combine the gVCFs in batches of ~200 gVCFs using a specialized tool, CombineGVCFs. This will make the next step more tractable and reflects that the processing bottleneck lies with the number of input files and not the number of samples in those files.

#### Joint Genotyping

All available samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites.

#### Variant Quality Score Recalibration

Variant recalibration involves using a machine learning method to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

### Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Geraldine Van der Auwera, PhD

Post edited by Geraldine_VdAuwera on

• USAMember Posts: 4

Hi,
I don't know if this is the right place to post this comment, but when using Firefox (v 29.0 Linux 64 bits) the "DNAseq" and "RNAseq" buttons don't work properly and I can't get additional information for the different steps.

It works properly using Chromium.

Alberto

Thanks for reporting this bug --I'll try to get that fixed asap.

Geraldine Van der Auwera, PhD

• CaliforniaMember Posts: 8

Just fyi: the bug reported by @albertoap still seems to exist. In case it helps....

@crojo, can you tell me what is your browser/platform version?

Geraldine Van der Auwera, PhD

• CaliforniaMember Posts: 8
edited June 2014

@Geraldine_VdAuwera‌ sure. I'm using Firefox v 29.0.1, on Mac OS X 10.7.5.

• Santiago de CompostelaMember Posts: 5

buttons don't work neither on Firefox v30 beta on Windows 8.1 64 bits, although tabs do work perfectly and all information seems to be accessible through them.

OK, I have a fix for this but can't apply it yet due to other pending modifications. Should be able to get this resolved next week. In the meantime, the content of these pages can be accessed directly through the forum in the GATK Documentation / Best Practices subcategories, for anyone who is completely unable to view these pages. Sorry for the inconvenience and thanks for your patience!

Geraldine Van der Auwera, PhD

• NetherlandsMember Posts: 41

Should be able to get this resolved next week

@Geraldine_VdAuwera: Are the buttons still broken? Or should i update my firefox v20? Maybe both

@mmterpstra‌ Not updated yet, sorry -- ran into a couple of issues that I need to fix before pushing changes to public. Maybe consider using Chrome in the meantime

Geraldine Van der Auwera, PhD

• INSERMMember Posts: 2

It is still broken with Firefox 32.0.3 on Windows 7.
Can you push changes to public?

@geoffroyGATK Sorry, this got pushed to the backburner. I'll try to deal with it this week as we have a couple of other fixes pending.

Geraldine Van der Auwera, PhD

• Member Posts: 4

DNAseq and RNAseq buttons are still broken - at least in Firefox 33.0 under mac OS 10.8. Since the permanent fix may be a ways off yet, here are the links to the two sections for others who prefer to work in Firefox: DNAseq and RNAseq. Hope this helps others while this is being fixed!

Also, Firefox displays the "broken image" symbol on some of the subsections of the RNAseq pages. These don't show up when using Safari (which may simply ignore broken images).

• Los Baños, Laguna, PhilippinesMember Posts: 1

I can confirm that it it still unresolved as of now. Better to just use Chrome.