**Notice:**

If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

**Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.**

# Statistical methods used by GATK tools

**This document is out of date; see individual method documents in the Methods and Algorithms section.**

### List of documented methods below

- Inbreeding Coefficient
- Rank Sum Test

## Inbreeding Coefficient

### Overview

Although the name Inbreeding Coefficient suggests it is a measure of inbreeding, Inbreeding Coefficient measures the excess heterozygosity at a variant site. It can be used as a proxy for poor mapping (sites that have high Inbreeding Coefficients are typically locations in the genome where the mapping is bad and reads that are in the region mismatch the region because they belong elsewhere). At least 10 samples are required (preferably many more) in order for this annotation to be calculated properly.

### Theory

The Wikipedia article about Hardy-Weinberg principle includes some very helpful information on the theoretical underpinnings of the test, as Inbreeding Coefficient relies on the math behind the Hardy-Weinberg Principle.

### Use in GATK

We calculate Inbreeding Coefficient as 1-(# observed heterozygotes)/(# expected heterozygotes). The number of observed heterozygotes can be calculated from the data. The number of expected heterozygotes is 2pq, where p is the frequency of the reference allele and q is the frequency of the alternate allele (AF). (Please see Hardy-Weinberg Principle link above). A value of 0 suggests the site is in Hardy-Weinberg Equilibrium. Negative values of Inbreeding Coefficient could mean there are too many heterozygotes and suggest a site with bad mapping. The other nice side effect is that one of the error modes in variant calling is for all calls to be heterozygous, which this metric captures nicely. This is why we recommend filtering out variants with negative Inbreeding Coefficients. Although positive values suggest too few heterozygotes, we do not recommend filtering out positive values because they could arise from admixture of different ethnic populations.

#### Please note: Inbreeding Coefficient is not really robust to the assumption of being unrelated. We have found that relatedness does break down the assumptions Inbreeding Coefficient is based on. For family samples, it really depends on how many families and samples you have. For example, if you have 3 families, inbreeding coefficient is not going to work. But, if you have 10,000 samples and just a few families, it should be fine. Also, if you pass in a pedigree file (*.ped), it will use that information to calculate Inbreeding Coefficient only using the founders (i.e. individuals whose parents aren't in the callset), and as long as there are >= 10 of those, the data should be pretty good.

### Example: Inbreeding Coefficient

In this example, lets say we are working with 100 human samples, and we are trying to calculate Inbreeding Coefficient at a site that has A for the reference allele and T for the alternate allele.

#### Step 1: Count the number of samples that have each genotype (hom-ref, het, hom-var)

A/A (hom-ref): 51

A/T (het): 11

T/T (hom-var): 38

#### Step 2: Get all necessary information to solve equation

We need to find the # observed hets and # expected hets.

number of observed hets = 11 (from number of observed A/T given above)

number of expected hets = 2pq * total genotypes (2pq is frequency of heterozygotes according to Hardy-Weinberg Equilibrium. We need to multiply that frequency by the number of all genotypes in the population to get the expected number of heterozygotes.)

p = frequency of ref allele = (# ref alleles)/(total # alleles) = (2 * 51 + 11)/(2 * 51 + 11 * 2 + 38 * 2) = 113/200 = 0.565

q = frequency of alt allele = (# alt alleles)/(total # alleles) = (2 * 38 + 11)/(2 * 51 + 11 * 2 + 38 * 2) = 87/200 = 0.435

#### Remember that homozygous genotypes have two copies of the allele of interest (because we're assuming diploid.)

number of expected hets = 2pq * 100 = 2 * 0.565 * 0.435 * 100 = 49.155

#### Step 3: Plug in the Numbers

Inbreeding Coefficient = 1 - (# observed hets)/(#expected hets) = 1 - (11/49.155) = 0.776

#### Step 4: Interpret the output

Our Inbreeding Coefficient is 0.776. Because it is a positive number, we can see there are fewer than the expected number of heterozygotes according to the Hardy-Weinberg Principle. Too few heterozygotes can imply inbreeding. However, we do not recommend filtering this site out because there may be a mixture of ethnicities in the cohort, and some ethnicities may be hom-ref while others are hom-var.

## Rank Sum Test

### Overview

The Rank Sum Test, also known as Mann-Whitney-Wilcoxon U-test after its developers (who are variously credited in subsets and in different orders depending on the sources you read) is a statistical test that aims to determine whether there is significant difference in the values of two populations of data.

### Theory

The Wikipedia article about the Rank Sum Test includes some very helpful information on the theoretical underpinnings of the test, as well as various examples of how it can be applied.

### Use in GATK

This test is used by several GATK annotations, including two standard annotations that are used for variant recalibration in the Best Practices: MappingQualityRankSum and ReadPosRankSum. In all cases, the idea is to check, for a given candidate variant, whether the properties of the data that support the reference allele are similar to those of the data that support a variant allele. If they are not similar, we conclude that there may be some technical bias and that the candidate variant may be an artifact.

### Example: BaseQualityRankSumTest

*Note: this example applies Method 2 from the Wikipedia article linked above.*

In this example, we have a set of 20 reads, 10 of which support the reference allele and 10 of which support the alternate allele. At first glance, that looks like a clear heterozygous 0/1 site. But to be thorough in our analysis and to account for any technical bias, we want to determine if there is a significant difference in the base qualities of the bases that support the reference allele vs. the bases that support the alternate allele.

Before we proceed, we must define our null hypothesis and alternate hypothesis.

-*Null hypothesis:* There is **no** difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

-*Alternate hypothesis:* There **is** a difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.

#### Step 1: List the relevant observations

Reference allele base qualities: 20, 25, 26, 30, 32, 40, 47, 50, 53, 60

Alternate allele base qualities: 0, 7, 10, 17, 20, 21, 30, 34, 40, 45

#### Step 2: Rank the observations

First, we arrange all the observations (base qualities) into a list of values ordered from lowest to highest (reference bases are in bold).

0, 7, 10, 17, **20**, 20, 21, **25**, **26**, **30**, 30, **32**, 34, **40**, 40, 45, **47**, **50**, **53**, **60**

Next we determine the ranks of the values. Since there are 20 observations (the base qualities), we have 20 ranks to assign. Whenever there are ties between observations for the rank, we take the rank to be equal to the midpoint of the ranks. For example, for 20(ref) and 20(alt), we have a tie in values, so we assign each observation a rank of (5+6)/2 = 5.5.

The ranks from the above list are (reference ranks are in bold):

1, 2, 3, 4, **5.5**, 5.5, 7, **8**, **9**, **10.5**, 10.5, **12**, 13, **14.5**, 14.5, 16, **17**, **18**, **19**, **20**

#### Step 3: Add up the ranks for each group

We now need to add up the ranks for the base qualities that came from the reference allele and the alternate allele.

$$ Rank_{ref} = 133.5 $$

$$ Rank_{alt} = 76.5 $$

#### Step 4: Calculate U for each group

U is a statistic that tells us the difference between the two rank totals. We can use the U statistic to calculate the z-score (explained below), which will give us our p-value.

Calculate U for each group (n = number of observations in each sample)

$$ U_{ref} = \frac{ n_{ref} * n_{alt} + n_{ref} * (n_{ref}+ 1) }{ 2 } - Rank_{ref} $$

$$ U_{alt} = \frac{ n_{alt} * n_{ref} + n_{alt} * (n_{alt} + 1) }{ 2 } - Rank_{alt} $$

$$ U_{ref} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 133.5 = 21.5 $$

$$ U_{alt} = \frac{ 10 * 10 + 10 * 11 }{ 2 } - 76.5 = 78.5 $$

#### Step 5: Calculate the overall z-score

Next, we need to calculate the z-score which will allow us to get the p-value. The z-score is a normalized score that allows us to compare the probability of the U score occurring in our distribution.

https://statistics.laerd.com/statistical-guides/standard-score.php

The equation to get the z-score is:

$$ z = \frac{U - mu}{u} $$

Breaking this equation down:

$$ z = z-score $$

$$ U = \text{lowest of the U scores calculated in previous steps} $$

$$ mu = \text{mean of the U scores above} = \frac{ n_{ref} * n_{alt} }{ 2 } $$

$$ u = \text{standard deviation of U} = \sqrt{ \frac{n_{ref} * n_{alt} * (n_{ref} + n_{alt} + 1) }{ 12 } } $$

To calculate our z:

$$ U = 21.5 $$

$$ mu = \frac{10 * 10 }{ 2 } = 50 $$

$$ u = \sqrt{ \frac{10 * 10 *(10 + 10 + 1) }{ 12 } } = 13.229 $$

So altogether we have:

$$ z = \frac{ 21.5 - 50 }{ 13.229 } = -2.154 $$

#### Step 6: Calculate and interpret the p-value

The p-value is the probability of obtaining a z-score at least as extreme as the one we got, assuming the null hypothesis is true. In our example, the p-value gives us the probability that there is no difference in the base qualities that support the reference allele and the base qualities that support the alternate allele. The lower the p-value, the less likely it is that there is no difference in the base qualities.

Going to the z-score table, or just using a p-value calculator, we find the p-value to be 0.0312.

This means there is a .0312 chance that the base quality scores of the reference allele and alternate allele are the same. Assuming a p-value cutoff of 0.05, meaning there is less than 5% chance there is no difference in the two groups, and greater than or equal to 95% chance that there is a difference between the two groups, we have enough evidence to **reject our null hypothesis** that there is no difference in the base qualities of the reference and alternate allele. This indicates there is some bias and that the alternate allele is less well supported by the data than the allele counts suggest.

## Comments

The explanation of the Rank Sum Tests here is very useful, but I don't think it explains how you combine information across samples. I am finding that many of my variants have missing values in the RankSum annotations. Having looked at my data and trawled the Web (e.g. http://seqanswers.com/forums/showthread.php?t=16556), I have come to realise that these annotations are only calculated from heterozygous samples - is that correct? I am assuming I am getting missing values because my samples are haploid (or in some cases mixtures of haploid genomes in unknown proportions). Is there any reason you couldn't calculate these RankSum metrics across all samples, not just heterozygous samples? Doing so would then presumably mean that I would not have missing values in my haploid samples.

As a follow on question, how are missing values treated by VQSR? Also, what is your recommendation for whether variables with large proportions of missing values should be included in VQSR training? I have recently made the decision to exclude MQRankSum and ReadPosRankSum from training for SNPs, and BaseQRankSum from training for indels, despite these being suggested in best practices.

Please ignore the bit about BaseQRankSum being suggested for training for indels in best practices - I'm not sure where I picked this up from but just checked and it seems MQRankSum and ReadPosRankSum are recommended for both SNPs and indels, but BaseQRankSum is recommended for neither.

Yes, ranksum tests are only done on het samples, as indicated in the individual annotation documentation iirc. This is because the test relies on seeing a mix of reads bearing the ref and the alt alleles. In homozygous samples (or haploids) you don't expect to see that, so it's just not an appropriate test.

Unfortunately many of our annotations have diploid-specific assumptions so their usability for non-diploids is limited.