The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

GATK VCF PL field ordering for pooled/polyploid samples

allenyuallenyu Member Posts: 3

Hi,

I am using GATK UnifiedGenotyper 2.3-9 for a set of pooled samples. I am uncertain about the order of PL values in polyploid samples, as it wasn't defined in VCF v4.1 specifications. The ordering formula described in VCF v4.1: F(j/k) = (k*(k+1)/2)+j only applies to diploid case. May I know how GATK extended the ordering formula to handle polypoid samples?

Example VCF line:
GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/0/1/1/1/1/1/1:71,177:249:17:6:0.750:5685,995,510,254,101,17,0,92,32767

Thank you very much!

Best regards,
Allen

Tagged:

Best Answer

Answers

  • delangeldelangel Broad InstituteMember Posts: 71
    edited February 2013

    This is a bit complicated, but it's a generalization of the formula you described above. Let Ploidy = P, # of alleles (including reference) = N.

    Let first Ne(N,P) = # of possible allele conformations given N and P in a sample, which is equivalent to the # of non-negative integer vectors of length N that add exactly to P. For example, Ne(2,2) = 3 (diploid biallelic case) because you can either have [2,0], [1,1] or [0,2] as possible allele conformations in a site.

    Ne can be computed recursively as:

    Ne(N,P) = sum_{k=0}^{k<=P} Ne(N-1,P-k)

    with boundary conditions Ne(1,P)=1. Ne(N,1) = N

    Then, to compute the index in the PL vector for a particular allele conformation stored in a vector called vectorIdx, the following pseudo-code can be used:

      int linearIdx = 0;
        int cumSum = ploidy;
        for (int k=numAlleles-1;k>=1; k--) {
            int idx = vectorIdx[k];
            if (idx == 0)
                continue;
            for (int p=0; p < idx; p++)
                linearIdx += Ne( k, cumSum-p);
    
            cumSum -= idx;
        }
    

    For the simple case where you only have 2 alleles (ref and alt), this collapses to the following ordering in the PL vector:

    [P,0]
    [P-1,1]
    ...
    [1,P-1]
    [0,P]
    

    You can see this intuitively in your genotype that you pasted. GT = 0/0/1/1/1/1/1/1 means that the most likely genotype in a pool has 2 ref chromosomes and 6 alt chromosomes. In the PL this conformation has the value =0 which is the most likely one.

  • allenyuallenyu Member Posts: 3

    Hi delangel,

    Thanks for your detailed reply!
    I don't quite understand the definition of vectorIdx in your pseudo-code. Using your Ne(2,2) example, the pseudo-code can be collapsed to:

    int linearIdx = 0;
    int cumSum = 2;
    
    int idx = vectorIdx[1];
    if (idx == 0)
        continue;
    for (int p=0; p < idx; p++)
        linearIdx += Ne( k, cumSum-p);
    cumSum -= idx;
    

    From your example, [2,0], [1,1] and [0,2] are the possible allele conformations stored in vectorIdx. However k can only be 1 when numAlleles is 2, thus corresponding to [1,1]. Could you explain a bit more on how to derive idx from vectorIdx?

    Best regards,
    Allen

  • allenyuallenyu Member Posts: 3

    Thanks again for this walkthrough.
    I thought vectorIdx stores all possible allele conformations, and instead it only stores the target allele conformation.
    I think this generalized ordering formula should be incorporated into future versions of VCF standard. Kudos for the GATK team!

Sign In or Register to comment.