This site is now readonly. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
GATK VCF PL field ordering for pooled/polyploid samples
Hi,
I am using GATK UnifiedGenotyper 2.39 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?
Example VCF line:
GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/0/1/1/1/1/1/1:71,177:249:17:6:0.750:5685,995,510,254,101,17,0,92,32767
Thank you very much!
Best regards,
Allen
Best Answer

delangel Broad Institute ✭✭
Say that you have a diploid organism, biallelic site (ploidy=2, want to find what's the linear index of say the conformation [2,0] (homozygous ref, since by convention the first index is always the ref allele). Then vectorIdx = [2,0]. As you noted, the outer for loop collapses and k is always 1 (so you only need to take a look at the second element of vectorIdx to figure out its index, which makes sense) the for l. Immediately, idx = vectorIdx[1] = 0 so linearIdx = 0 and we know then that the 0'th element of the PL's will hold that vectorIdx=[2,0].
Now say that vectorIdx = [1,1] (normal heterozygous genotype). In this case, idx = 1. The inner for loop collapses to only p=0 , so
linearIdx = Ne(1,21) = Ne(1,1) = 1.
So element 1 of the PL vector holds this allele conformation.
Similarly, for vectorIdx = [0,2] (homozygous genotype), idx = 2. When p=0, linearIdx = Ne(1,20) = Ne(1,2) = 1.
When p=1, linearIdx += Ne(1,21) = 1, so linearIdx = 2.Hope this makes sense.
Answers
This is a bit complicated, but it's a generalization of the formula you described above. Let
Ploidy = P
,# of alleles (including reference) = N
.Let first
Ne(N,P) = # of possible allele conformations given N and P in a sample
, which is equivalent to the # of nonnegative integer vectors of length N that add exactly to P. For example,Ne(2,2) = 3 (diploid biallelic case)
because you can either have[2,0]
,[1,1]
or[0,2]
as possible allele conformations in a site.Ne can be computed recursively as:
Ne(N,P) = sum_{k=0}^{k<=P} Ne(N1,Pk)
with boundary conditions
Ne(1,P)=1. Ne(N,1) = N
Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudocode can be used:
For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:
You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.
Hi delangel,
Thanks for your detailed reply!
I don't quite understand the definition of vectorIdx in your pseudocode. Using your Ne(2,2) example, the pseudocode can be collapsed to:
From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?
Best regards,
Allen
Say that you have a diploid organism, biallelic site (ploidy=2, want to find what's the linear index of say the conformation [2,0] (homozygous ref, since by convention the first index is always the ref allele). Then vectorIdx = [2,0]. As you noted, the outer for loop collapses and k is always 1 (so you only need to take a look at the second element of vectorIdx to figure out its index, which makes sense) the for l. Immediately, idx = vectorIdx[1] = 0 so linearIdx = 0 and we know then that the 0'th element of the PL's will hold that vectorIdx=[2,0].
Now say that vectorIdx = [1,1] (normal heterozygous genotype). In this case, idx = 1. The inner for loop collapses to only p=0 , so
linearIdx = Ne(1,21) = Ne(1,1) = 1.
So element 1 of the PL vector holds this allele conformation.
Similarly, for vectorIdx = [0,2] (homozygous genotype), idx = 2. When p=0, linearIdx = Ne(1,20) = Ne(1,2) = 1.
When p=1, linearIdx += Ne(1,21) = 1, so linearIdx = 2.
Hope this makes sense.
Thanks again for this walkthrough.
I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!