We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

# GATK VCF PL field ordering for pooled/polyploid samples

Hi,

I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?

Example VCF line:

Thank you very much!

Best regards,
Allen

Tagged:

• This is a bit complicated, but it's a generalization of the formula you described above. Let `Ploidy = P`, `# of alleles (including reference) = N`.

Let first `Ne(N,P) = # of possible allele conformations given N and P in a sample`, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example, `Ne(2,2) = 3 (diploid biallelic case)` because you can either have `[2,0]`, `[1,1]` or `[0,2]` as possible allele conformations in a site.

Ne can be computed recursively as:

`Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)`

with boundary conditions `Ne(1,P)=1. Ne(N,1) = N`

Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:

```  int linearIdx = 0;
int cumSum = ploidy;
for (int k=numAlleles-1;k>=1; k--) {
int idx = vectorIdx[k];
if (idx == 0)
continue;
for (int p=0; p < idx; p++)
linearIdx += Ne( k, cumSum-p);

cumSum -= idx;
}
```

For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:

```[P,0]
[P-1,1]
...
[1,P-1]
[0,P]
```

You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.

• Hi delangel,

I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:

```int linearIdx = 0;
int cumSum = 2;

int idx = vectorIdx;
if (idx == 0)
continue;
for (int p=0; p < idx; p++)
linearIdx += Ne( k, cumSum-p);
cumSum -= idx;
```

From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?

Best regards,
Allen

• Thanks again for this walkthrough.
I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!