The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

GATK VCF PL field ordering for pooled/polyploid samples

allenyuallenyu Member Posts: 3


I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?

Example VCF line:
GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/0/1/1/1/1/1/1:71,177:249:17:6:0.750:5685,995,510,254,101,17,0,92,32767

Thank you very much!

Best regards,


Best Answer


  • delangeldelangel Dev Posts: 71
    edited February 2013

    This is a bit complicated, but it's a generalization of the formula you described above. Let Ploidy = P, # of alleles (including reference) = N.

    Let first Ne(N,P) = # of possible allele conformations given N and P in a sample, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example, Ne(2,2) = 3 (diploid biallelic case) because you can either have [2,0], [1,1] or [0,2] as possible allele conformations in a site.

    Ne can be computed recursively as:

    Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)

    with boundary conditions Ne(1,P)=1. Ne(N,1) = N

    Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:

      int linearIdx = 0;
        int cumSum = ploidy;
        for (int k=numAlleles-1;k>=1; k--) {
            int idx = vectorIdx[k];
            if (idx == 0)
            for (int p=0; p < idx; p++)
                linearIdx += Ne( k, cumSum-p);
            cumSum -= idx;

    For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:


    You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.

  • allenyuallenyu Member Posts: 3

    Hi delangel,

    Thanks for your detailed reply!
    I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:

    int linearIdx = 0;
    int cumSum = 2;
    int idx = vectorIdx[1];
    if (idx == 0)
    for (int p=0; p < idx; p++)
        linearIdx += Ne( k, cumSum-p);
    cumSum -= idx;

    From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?

    Best regards,

  • allenyuallenyu Member Posts: 3

    Thanks again for this walkthrough.
    I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
    I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!

Sign In or Register to comment.