# Best Of

### Re: Mutect2 output when calling variants for PON has genotype 0/1 for homozygous SNPs

I find myself revisiting this issue. I see no 1/1 genotype calls emitted by Mutect2 in two different projects of mine. From the above, I think it is true that to call a genotype of a tumor sample at a locus, one CANNOT use the GT values of the VCF file produced by Mutect2. For example, here are the GT:AD values of a VCF file entry for a sample:

0/1:0,20

In this case, the genotype might be said to be 1/1 or homozygous alternate.

However, this case is not described anywhere in your documentation that I can find. The closest I see to a description of Mutect2 GT field is at:

This shows several examples but none like the above.

I do see some good description in the article you mentioned above:

https://software.broadinstitute.org/gatk/documentation/article?id=11127

That article did not come up in my searches for Mutect2 lacking homozygous genotype calls.

After further consideration here, and reading the above, I'm now seeing where you are coming from. Since cancer genomes are so chaotic, with so much copy number variation and normal contamination, how do you really define genotype? If there are 10 copies of a locus, when do you call it heterozygous and when homozygous? Given this, I need to go back and see where I make use of the GT field, and question whether I should even be doing that.

On the other hand, people speak of LOH in cancer. How would that be defined when amplification is present? Does LOH mean that previously there were 1 or more copies of both wild type allele, and now there are 0 copies of it? How would one infer this from the Mutect2 VCF file? If AD were, say, 1,100, then probably one would assume the 1 WT read is likely an error, and the best call would be LOH. But it is a statistical thing, and Mutect2 should be dealing with those statistics, and putting something in the VCF to indicate that it has called the locus as LOH. Is there such a value output by Mutect2?

It seems to me that if you are going to use the GT key differently than the common conception of it, you should be more consistent with it, in the following way: either list GT numbers for alleles with non-zero read count ONLY, OR list GT numbers for ALL alleles and include read counts for ALL alleles even if 0. It makes no sense to say GT:AD=0/1:0,20 if the entire concept of assuming two copies of the allele are present has been discarded. Instead, use GT:AD=1:20, or, suppose ref=A, alt=C,T, then use for example GT:AD=0/1/2:0,20,0 where you show all alternate allele read counts. You mentioned something about constraints by other tools. My understanding now is that the VCF spec does not require the GT value to be genotype, although it doesn't explicitly say this (and the GT name doesn't help). If so, then I would think you should use the GT field in the most logical and useful way possible (which IMHO would be to list the read counts of ALL alleles including reference, even if 0). If other tools misinterpret it, they are probably misinterpreting it ANYWAY, regardless of how you use it.

And definitely you should include clear descriptions of how the GT and AD fields work in the documentation for Mutect2, and clearly indicate that the GT field should NOT be interpreted as genotype, and briefly state why.

### Re: (How to) Filter variants either with VQSR or by hard-filtering

Hi @shlee

I have whole genome sequence bam files from 96 animals. I am thinking of doing hard filtering as advised here https://software.broadinstitute.org/gatk/documentation/article.php?id=3225 in broad institute website.

- Extract the SNPs and Indels from the call set - SelectVariants tool
- Apply the filter to the SNP and Indel call set - VariantFiltration tool
- CombineVariants in to single vcf file - CombineVariants

Are these steps still valid?

I plan to perform bootstrapping method. I generated gvcf files for each bam file(96).

Now I am not sure whether to generated separate vcf files(96) for each bam file and do the bootstrapping separately and combine the final trained vcf files (96) into a single file to come up with the known truth set of variants. Or call CombineGVCFs on 96 gvcf files and generate a single gvcf file and a single vcf file out of it. Then use that single vcf file to bootstrap 96 bam files separately.

Any help is much appreciated.

Kalpi

### Re: LiftoverVCF chain file for b37 to hg38

@SChaluvadi Can we add this to the resource bundle?

### Re: gatk4 error: java.lang.IllegalStateException: The covariates table is missing ReadGroup

"

...thank you and GATK team, I've already solved my problem now."

Hi there, @xingaulag! For the benefit of others who may have encountered a similar issue, would you mind explaining how you got it to work?

### Assigning per-sample genotypes (HaplotypeCaller)

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as the one on PL and GQ.

This procedure is NOT applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

#### Contents

- Overview
- Preliminary assumptions / limitations
- Calculating genotype likelihoods using Bayes' Theorem
- Selecting a genotype and emitting the call record

## 1. Overview

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

Note that this describes the **regular mode** of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in `GVCF`

modes (`-ERC GVCF`

and `-ERC BP_RESOLUTION`

) please see the reference confidence model documentation.

## 2. Preliminary assumptions / limitations

### Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

### Ploidy

Both the HaplotypeCaller and GenotypeGVCFs assume that the organism of study is diploid by default, but the desired ploidy can be set using the `-ploidy`

argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation using a generalized form of the genotyping algorithm that can handle ploidies other than 2. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered. There are some arguments that aim to mitigate those limitations but they are not fully documented yet.

### Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

### Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

## 3. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype **G** given the observed data **D**.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype **G**:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype **G** is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype **G** into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype **G**.

## 4. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

### Re: Missing PS field in the VCF file produced by GenotypeGVCFs

I'm sorry @svitlana, but it looks like you asked your question at an inopportune time! We have just moved to our new site (accessible at gatk.broadinstitute.org) and have shut down posting here. That means means that you will not find the answer to your question, here!

In the meantime, I've duplicated your question to the new forum at this url, where you can add your own comments and follow the thread for updates. Please be sure to register for a new forum account (this new account will extend to all the software we support, including Terra and Cromwell/WDL). We apologize for any inconvenience!

You can find more information about the move on our blog post.

### Re: SAC annotation

Thank you for your input and contribution @johnma!

### Re: SAC annotation

Maybe it's a bit too late, but what I'd do is to:

1. Use HaplotypeCaller to do forced genotyping, using the `--alleles`

argument, and then

2. Use `bcftools annotate -a HC.vcf.gz -c FMT/SAC old.vcf`

to copy the SAC field back to the old callset.

### Re: Filter samples of bad quality before running GermlineCNVCaller

It may be better to check mean and median coverage of samples before generating a cohort. Greater the variability between samples the more false positives and negatives you get.

### Re: HaplotypeCallerSpark throws error Unable to find class: htsjdk.samtools.reference.AbstractFastaSeque

Thank you! Merry Christmas and New Year to you as well. I hope you get some time to relax without any spark bugs interrupting you!