# Best Of

### Re: LiftoverVCF chain file for b37 to hg38

@SChaluvadi Can we add this to the resource bundle?

### Re: gatk4 error: java.lang.IllegalStateException: The covariates table is missing ReadGroup

"

...thank you and GATK team, I've already solved my problem now."

Hi there, @xingaulag! For the benefit of others who may have encountered a similar issue, would you mind explaining how you got it to work?

### Assigning per-sample genotypes (HaplotypeCaller)

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as the one on PL and GQ.

This procedure is NOT applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

#### Contents

- Overview
- Preliminary assumptions / limitations
- Calculating genotype likelihoods using Bayes' Theorem
- Selecting a genotype and emitting the call record

## 1. Overview

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

Note that this describes the **regular mode** of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in `GVCF`

modes (`-ERC GVCF`

and `-ERC BP_RESOLUTION`

) please see the reference confidence model documentation.

## 2. Preliminary assumptions / limitations

### Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

### Ploidy

Both the HaplotypeCaller and GenotypeGVCFs assume that the organism of study is diploid by default, but the desired ploidy can be set using the `-ploidy`

argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation using a generalized form of the genotyping algorithm that can handle ploidies other than 2. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered. There are some arguments that aim to mitigate those limitations but they are not fully documented yet.

### Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

### Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

## 3. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype **G** given the observed data **D**.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype **G**:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype **G** is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype **G** into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype **G**.

## 4. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

### Re: Missing PS field in the VCF file produced by GenotypeGVCFs

I'm sorry @svitlana, but it looks like you asked your question at an inopportune time! We have just moved to our new site (accessible at gatk.broadinstitute.org) and have shut down posting here. That means means that you will not find the answer to your question, here!

In the meantime, I've duplicated your question to the new forum at this url, where you can add your own comments and follow the thread for updates. Please be sure to register for a new forum account (this new account will extend to all the software we support, including Terra and Cromwell/WDL). We apologize for any inconvenience!

You can find more information about the move on our blog post.

### Re: SAC annotation

Thank you for your input and contribution @johnma!

### Re: SAC annotation

Maybe it's a bit too late, but what I'd do is to:

1. Use HaplotypeCaller to do forced genotyping, using the `--alleles`

argument, and then

2. Use `bcftools annotate -a HC.vcf.gz -c FMT/SAC old.vcf`

to copy the SAC field back to the old callset.

### Re: Filter samples of bad quality before running GermlineCNVCaller

It may be better to check mean and median coverage of samples before generating a cohort. Greater the variability between samples the more false positives and negatives you get.

### Re: HaplotypeCallerSpark throws error Unable to find class: htsjdk.samtools.reference.AbstractFastaSeque

Thank you! Merry Christmas and New Year to you as well. I hope you get some time to relax without any spark bugs interrupting you!

### Re: Mutect2 repeatedly not detecting somatic variant IDH2 R172K, with solid read support and 5% AF

@mack812 What's going on under the hood is as follows: when M2 initially tries to assemble with a kmer size of 10, there happens to be a few stray reads with an error that induces a cycle in the graph. That is, there is, say, a kmer AAACCCTTTG toward the end of the amplicon and a kmer AAAGCCTTTG toward the beginning. A single base error in the former can induce a path at the end of the amplicon to jump to the beginning.

When the assembly engine finds a cycle, it gives up and tries a larger kmer, by default 25. This is fine (there is a 10-base pseudo-homology but not a 25-base one), but because the variant occurs less than 25 bases from the start of the amplicon it ends up on a dangling end of the graph. We have some code to handle this but it's a bit sloppy and incomplete, to be honest, so the variant is missed.

We are working on improvements to the assembly engine that will let it handle cycles in the graph (using the linked de Bruijn graph structure that our colleague Kiran Garimella co-authored: https://academic.oup.com/bioinformatics/article/34/15/2556/4938484), and are also improving dangling end recovery. Until these efforts mature, however, mitochondria mode is probably the best and most principled fix.

Since you want more details, I'll first describe a couple of hackier solutions. One was to hope that a 14-mer would not induces cycles in the graph. The variant is 15 bases from the amplicon edge, so this is the biggest kmer you can get without inducing a dangling end. Indeed it does not, so `--kmer-size 14`

solves the problem. The second solution is to use default downsampling settings, which reduces your amplicon coverage greatly, of course. Purely by luck this ends up removing enough reads with the cycle-inducing error and assembly with 10-mers works.

The reason mitochondria mode works is because it is better at recovery dangling ends at high depth. Assembly with 10-mers still fails and 25-mers still put the variant on a dangling end, but in mitochondria mode we're able to grab the variant from the dangling end. At lower depths this creates a handful of weird false positives so we can't recommend always turning on mitochondria mode. However, I think it's safe to recommend mitochondria mode as the best practice for amplicon sequencing.

### Re: HaplotypeCallerSpark throws error Unable to find class: htsjdk.samtools.reference.AbstractFastaSeque

Hi @sjbosco. This looks like a bug on our end although I'm not sure why we don't see it in our tests. We test agains spark 2.4.3 so I wonder if there is an incompatibility there.

You shouldn't need to specify a separate Htsjdk jar since the correct one is included in the gatk jar we distribute.

I'll see if I can reproduce with a similar command line on my machine and get back to you.