**Notice:**

If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

**Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.**

# Inbreeding Coefficient

## Overview

Although the name Inbreeding Coefficient suggests it is a measure of inbreeding, Inbreeding Coefficient measures the excess heterozygosity at a variant site. It can be used as a proxy for poor mapping (sites that have high Inbreeding Coefficients are typically locations in the genome where the mapping is bad and reads that are in the region mismatch the region because they belong elsewhere). At least 10 samples are required (preferably many more) in order for this annotation to be calculated properly.

### Theory

The Wikipedia article about Hardy-Weinberg principle includes some very helpful information on the theoretical underpinnings of the test, as Inbreeding Coefficient relies on the math behind the Hardy-Weinberg Principle.

### Use in GATK

We calculate Inbreeding Coefficient as

$$ 1-\frac{ \text{# observed heterozygotes} }{ \text{# expected heterozygotes} } $$

The number of observed heterozygotes can be calculated from the data. The number of expected heterozygotes is `2pq`

, where `p`

is the frequency of the reference allele and `q`

is the frequency of the alternate allele (AF). (Please see Hardy-Weinberg Principle link above).

A value of 0 suggests the site is in Hardy-Weinberg Equilibrium. Negative values of Inbreeding Coefficient could mean there are too many heterozygotes and suggest a site with bad mapping. The other nice side effect is that one of the error modes in variant calling is for all calls to be heterozygous, which this metric captures nicely. This is why we recommend filtering out variants with negative Inbreeding Coefficients. Although positive values suggest too few heterozygotes, we do not recommend filtering out positive values because they could arise from admixture of different ethnic populations.

#### Important note:

Inbreeding Coefficient is not really robust to the assumption of being unrelated. We have found that relatedness does break down the assumptions Inbreeding Coefficient is based on. For family samples, it really depends on how many families and samples you have. For example, if you have 3 families, inbreeding coefficient is not going to work. But, if you have 10,000 samples and just a few families, it should be fine. Also, if you pass in a pedigree file (*.ped), it will use that information to calculate Inbreeding Coefficient only using the founders (i.e. individuals whose parents aren't in the callset), and as long as there are >= 10 of those, the data should be pretty good.

## Example: Inbreeding Coefficient

In this example, let's say we are working with 100 human samples, and we are trying to calculate Inbreeding Coefficient at a site that has A for the reference allele and T for the alternate allele.

### Step 1: Count the number of samples that have each genotype

HOM-REF A/A : 51

HET A/T : 11

HOM-VAR T/T : 38

### Step 2: Get all necessary information to solve equation

We need to find the # observed hets and # expected hets:

$$ \text{number of observed hets} = 11 $$

from the number of observed A/T given above, and

$$ \text{number of expected hets} = 2pq * \text{total genotypes} $$

where `2pq`

is the frequency of heterozygotes according to Hardy-Weinberg Equilibrium.

We need to multiply that frequency by the number of all genotypes in the population to get the expected number of heterozygotes.

So let's calculate `p`

:

$$ p = \text{frequency of ref allele} = \frac{ \text{# ref alleles} }{ \text{total # alleles} } $$

$$ p = \frac{ 2 * 51 + 11 }{ 2 * 51 + 11 * 2 + 38 * 2} $$

$$ p = \frac{ 113 }{ 200 } = 0.565 $$

And now let's calculate `q`

:

$$ q = \text{frequency of alt allele} = \frac{ \text{# alt alleles} }{ \text{total # alleles} } $$

$$ q = \frac{ 2 * 38 + 11 }{ 2 * 51 + 11 * 2 + 38 * 2 } $$

$$ q = 87/200 = 0.435 $$

Remember that homozygous genotypes have two copies of the allele of interest (because we're assuming a diploid organism).

$$ \text{number of expected hets} = 2pq * 100 $$

$$ = 2 * 0.565 * 0.435 * 100 = 49.155 $$

### Step 3: Plug in the Numbers

$$ \text{Inbreeding Coefficient} = 1 - \frac{ \text{# observed hets} }{ \text{#expected hets} } $$

$$ \text{IC} = 1 - \frac{ 11 }{49.155} = 0.776 $$

### Step 4: Interpret the output

Our Inbreeding Coefficient is 0.776. Because it is a positive number, we can see there are fewer than the expected number of heterozygotes according to the Hardy-Weinberg Principle. Too few heterozygotes can imply inbreeding. Depending on the cohort we are working with, this could be a sign of false positives.