Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

GATK VCF PL field ordering for pooled/polyploid samples

allenyuallenyu Posts: 3Member

Hi,

I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?

Example VCF line:
GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/0/1/1/1/1/1/1:71,177:249:17:6:0.750:5685,995,510,254,101,17,0,92,32767

Thank you very much!

Best regards,
Allen

Tagged:

Best Answer

Answers

  • delangeldelangel Posts: 71GATK Developer mod
    edited February 2013

    This is a bit complicated, but it's a generalization of the formula you described above. Let Ploidy = P, # of alleles (including reference) = N.

    Let first Ne(N,P) = # of possible allele conformations given N and P in a sample, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example, Ne(2,2) = 3 (diploid biallelic case) because you can either have [2,0], [1,1] or [0,2] as possible allele conformations in a site.

    Ne can be computed recursively as:

    Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)

    with boundary conditions Ne(1,P)=1. Ne(N,1) = N

    Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:

      int linearIdx = 0;
    int cumSum = ploidy;
    for (int k=numAlleles-1;k>=1; k--) {
    int idx = vectorIdx[k];
    if (idx == 0)
    continue;
    for (int p=0; p < idx; p++)
    linearIdx += Ne( k, cumSum-p);

    cumSum -= idx;
    }

    For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:

    [P,0]
    [P-1,1]
    ...
    [1,P-1]
    [0,P]

    You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.

    Post edited by Geraldine_VdAuwera on
  • allenyuallenyu Posts: 3Member

    Hi delangel,

    Thanks for your detailed reply!
    I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:

    int linearIdx = 0;
    int cumSum = 2;

    int idx = vectorIdx[1];
    if (idx == 0)
    continue;
    for (int p=0; p < idx; p++)
    linearIdx += Ne( k, cumSum-p);
    cumSum -= idx;

    From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?

    Best regards,
    Allen

  • allenyuallenyu Posts: 3Member

    Thanks again for this walkthrough.
    I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
    I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!

Sign In or Register to comment.