Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Somatic short variant discovery (SNVs + Indels)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

Purpose

Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.



Reference Implementations

Pipeline Summary Notes Github Terra
Somatic short variants tumor-normal pair T-N BAMs to VCF universal yes b37
Somatic short variants PON creation Normal BAMs to PON universal yes b37

Expected input

This workflow requires BAM files for each input tumor and normal sample. Input BAMs should be pre-processed as described in the GATK Best Practices for data pre-processing.


Main steps

There are two main steps to this workflow. First we generate a large set of candidate somatic variants, then we filter them to obtain a more confident set of somatic variant calls.

Call candidate variants

Tools involved: Mutect2

Like HaplotypeCaller, Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.

Calculate Contamination

Tools involved: GetPileupSummaries, CalculateContamination

This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.

Learn Orientation Bias Artifacts

Tools involved: LearnReadOrientationModel

This tool uses an optional F1R2 counts output of Mutect2 to learn the parameters of a model for orientation bias. It finds prior probabilities of single-stranded substitution errors prior to sequencing for each trinucleotide context. This is extremely important for FFPE tumor samples.

Filter Variants

Tools involved: FilterMutectCalls

Mutect2’s somatic likelihoods model assumes that read errors are independent, so that, for example, four reads each with an error probability of 1/1000 yield a log odds of roughly 1000^4 in favor of being a real variant versus a sequencing error. FilterMutectCalls accounts for correlated errors, that is, the possibility that all variant reads at a site were due to some common source of error. It accomplishes this through several hard filters to detect alignment artifacts and probabilistic models for strand and orientation bias artifacts, polymerase slippage artifacts, germline variants, and contamination. Additionally, it learns a Bayesian model for the overall SNV and indel mutation rate and allele fraction spectrum of the tumor to refine the log odds emitted by Mutect2. It then automatically sets a filtering threshold to optimize the F score, the harmonic mean of sensitivity and precision.

Annotate Variants

Tools involved: Funcotator

At this step we run tools to add information to the discovered variants in our dataset. One of those tools, Funcotator, can be used to add gene-level information to each variant. Funcotator is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.

Additional Information


Post edited by akovalsk on

Comments

Sign In or Register to comment.