It looks like you're new here. If you want to get involved, click one of these buttons!
Hi,
I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?
Example VCF line:
GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/0/1/1/1/1/1/1:71,177:249:17:6:0.750:5685,995,510,254,101,17,0,92,32767
Thank you very much!
Best regards,
Allen
Say that you have a diploid organism, biallelic site (ploidy=2, want to find what's the linear index of say the conformation [2,0] (homozygous ref, since by convention the first index is always the ref allele). Then vectorIdx = [2,0]. As you noted, the outer for loop collapses and k is always 1 (so you only need to take a look at the second element of vectorIdx to figure out its index, which makes sense) the for l. Immediately, idx = vectorIdx[1] = 0 so linearIdx = 0 and we know then that the 0'th element of the PL's will hold that vectorIdx=[2,0].
Now say that vectorIdx = [1,1] (normal heterozygous genotype). In this case, idx = 1. The inner for loop collapses to only p=0 , so
linearIdx = Ne(1,2-1) = Ne(1,1) = 1.
So element 1 of the PL vector holds this allele conformation.
Similarly, for vectorIdx = [0,2] (homozygous genotype), idx = 2. When p=0, linearIdx = Ne(1,2-0) = Ne(1,2) = 1.
When p=1, linearIdx += Ne(1,2-1) = 1, so linearIdx = 2.
Hope this makes sense.
Answers
This is a bit complicated, but it's a generalization of the formula you described above. Let
Ploidy = P
,# of alleles (including reference) = N
.Let first
Ne(N,P) = # of possible allele conformations given N and P in a sample
, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example,Ne(2,2) = 3 (diploid biallelic case)
because you can either have[2,0]
,[1,1]
or[0,2]
as possible allele conformations in a site.Ne can be computed recursively as:
Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)
with boundary conditions
Ne(1,P)=1. Ne(N,1) = N
Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:
For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:
You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.
Hi delangel,
Thanks for your detailed reply!
I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:
From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?
Best regards,
Allen
Say that you have a diploid organism, biallelic site (ploidy=2, want to find what's the linear index of say the conformation [2,0] (homozygous ref, since by convention the first index is always the ref allele). Then vectorIdx = [2,0]. As you noted, the outer for loop collapses and k is always 1 (so you only need to take a look at the second element of vectorIdx to figure out its index, which makes sense) the for l. Immediately, idx = vectorIdx[1] = 0 so linearIdx = 0 and we know then that the 0'th element of the PL's will hold that vectorIdx=[2,0].
Now say that vectorIdx = [1,1] (normal heterozygous genotype). In this case, idx = 1. The inner for loop collapses to only p=0 , so
linearIdx = Ne(1,2-1) = Ne(1,1) = 1.
So element 1 of the PL vector holds this allele conformation.
Similarly, for vectorIdx = [0,2] (homozygous genotype), idx = 2. When p=0, linearIdx = Ne(1,2-0) = Ne(1,2) = 1.
When p=1, linearIdx += Ne(1,2-1) = 1, so linearIdx = 2.
Hope this makes sense.
Thanks again for this walkthrough.
I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!