The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

GATK VCF PL field ordering for pooled/polyploid samples

allenyuallenyu Posts: 3Member

Hi,

I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?

Example VCF line:
GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/0/1/1/1/1/1/1:71,177:249:17:6:0.750:5685,995,510,254,101,17,0,92,32767

Thank you very much!

Best regards,
Allen

Tagged:

Best Answer

Answers

  • delangeldelangel Posts: 71Dev mod
    edited February 2013

    This is a bit complicated, but it's a generalization of the formula you described above. Let Ploidy = P, # of alleles (including reference) = N.

    Let first Ne(N,P) = # of possible allele conformations given N and P in a sample, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example, Ne(2,2) = 3 (diploid biallelic case) because you can either have [2,0], [1,1] or [0,2] as possible allele conformations in a site.

    Ne can be computed recursively as:

    Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)

    with boundary conditions Ne(1,P)=1. Ne(N,1) = N

    Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:

      int linearIdx = 0;
        int cumSum = ploidy;
        for (int k=numAlleles-1;k>=1; k--) {
            int idx = vectorIdx[k];
            if (idx == 0)
                continue;
            for (int p=0; p < idx; p++)
                linearIdx += Ne( k, cumSum-p);
    
            cumSum -= idx;
        }
    

    For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:

    [P,0]
    [P-1,1]
    ...
    [1,P-1]
    [0,P]
    

    You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.

  • allenyuallenyu Posts: 3Member

    Hi delangel,

    Thanks for your detailed reply!
    I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:

    int linearIdx = 0;
    int cumSum = 2;
    
    int idx = vectorIdx[1];
    if (idx == 0)
        continue;
    for (int p=0; p < idx; p++)
        linearIdx += Ne( k, cumSum-p);
    cumSum -= idx;
    

    From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?

    Best regards,
    Allen

  • allenyuallenyu Posts: 3Member

    Thanks again for this walkthrough.
    I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
    I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!

Sign In or Register to comment.