The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# GATK VCF PL field ordering for pooled/polyploid samples

Posts: 3Member

Hi,

I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?

Example VCF line:

Thank you very much!

Best regards,
Allen

Tagged:

• Posts: 71GATK Developer mod
edited February 2013

This is a bit complicated, but it's a generalization of the formula you described above. Let Ploidy = P, # of alleles (including reference) = N.

Let first Ne(N,P) = # of possible allele conformations given N and P in a sample, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example, Ne(2,2) = 3 (diploid biallelic case) because you can either have [2,0], [1,1] or [0,2] as possible allele conformations in a site.

Ne can be computed recursively as:

Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)

with boundary conditions Ne(1,P)=1. Ne(N,1) = N

Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:

  int linearIdx = 0;
int cumSum = ploidy;
for (int k=numAlleles-1;k>=1; k--) {
int idx = vectorIdx[k];
if (idx == 0)
continue;
for (int p=0; p < idx; p++)
linearIdx += Ne( k, cumSum-p);

cumSum -= idx;
}


For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:

[P,0]
[P-1,1]
...
[1,P-1]
[0,P]


You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.

Post edited by Geraldine_VdAuwera on
• Posts: 3Member

Hi delangel,

I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:

int linearIdx = 0;
int cumSum = 2;

int idx = vectorIdx[1];
if (idx == 0)
continue;
for (int p=0; p < idx; p++)
linearIdx += Ne( k, cumSum-p);
cumSum -= idx;


From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?

Best regards,
Allen

• Posts: 3Member

Thanks again for this walkthrough.
I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!