PID FORMAT field in GenotypeGVCFs output


I'd be grateful for your help in understanding how GATK add phase information (PID field) in GenotypeGVCFs.

Here's an example of two lines from a VCF I'm getting for running GenotypeGVCFs on a set of samples:

chr1 8864083 . T C 462.27 . AC=4;AF=0.500;AN=8;BaseQRankSum=-4.950e-01;ClippingRankSum=0.406;DP=74;FS=0.000;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=0.306;QD=6.42;ReadPosRankSum=1.54;SOR=0.674 GT:AD:DP:GQ:PL 0/1:22,12:34:99:205,0,518 0/1:10,4:14:60:60,0,220 0/1:3,3:6:71:71,0,72 0/1:10,8:18:99:158,0,238 ./.:2,0:2
chr1 8864426 . C T 5302.72 . AC=5;AF=0.500;AN=10;BaseQRankSum=0.815;ClippingRankSum=-6.270e-01;DP=316;FS=7.664;MLEAC=5;MLEAF=0.500;MQ=60.00;MQRankSum=-6.130e-01;QD=17.05;ReadPosRankSum=-1.240e-01;SOR=0.777 GT:AD:DP:GQ:PGT:PID:PL 0/1:24,66:90:99:0|1:8864426_C_T:1596,0,649 0/1:20,62:82:99:.:.:1502,0,536 0/1:24,37:61:99:0|1:8864426_C_T:921,0,647 0/1:22,41:63:99:0|1:8864426_C_T:1014,0,717 0/1:2,13:15:74:0|1:8864426_C_T:304,0,74

What's the interpretation for the phase information of the last sample in chr1:8864426? in position chr1:8864083 its genotype is not reported but in the proceeding position chr1:8864426 it is phased 0|1 WRT position chr1:8864083.

If I subsequently run SelectVariants for that sample, with --excludeNonVariants this phase information is retained but clearly looses context. So another question is if there is a way to change the PID field when running SelectVariants or VariantFiltration in case the position it is phased with is filtered?

Thanks a lot

Best Answer


  • Actually, to expand on this I found a similar situation in the VCF output of the GenotypeGVCFs step.

    chr1 346039 . T C 49.89 . AC=1;AF=0.167;AN=6;BaseQRankSum=0.727;ClippingRankSum=-7.270e-01;DP=10;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.727;QD=12.47;ReadPosRankSum=0.727;SOR=0.446 GT:AD:DP:GQ:PL 0/1:1,3:4:15:79,0,15 0/0:4,0:4:9:0,9,135 0/0:2,0:2:6:0,6,49 ./.:0,0:0
    chr1 346826 . G C 1158.94 . AC=4;AF=0.500;AN=8;BaseQRankSum=1.39;ClippingRankSum=0.264;DP=109;FS=0.721;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=1.03;QD=10.63;ReadPosRankSum=0.076;SOR=0.823 GT:AD:DP:GQ:PGT:PID:PL 0/1:12,9:21:99:1|0:346816_G_A:249,0,308 0/1:17,15:32:99:.:.:303,0,385 0/1:19,9:28:99:.:.:225,0,577 0/1:14,14:28:99:.:.:414,0,326

    At position: chr1:346826 the first sample is phased WRT position 346816, as indicated by the PID field, but this position does not exists in this file. Is this phase information retained from the GVCF files?

    I also wonder about how is the encoding of the PID field decided on? If a position is phased shouldn't the PID field inform on the position it is phased WRT? What I see is that the PID of a first phased position in a sequence of phased positions has its ID. For example the PID of the second sample in the line blow is: 351879_C_G. Shouldn't it indicate the preceding position?

    chr1 351879 . C G 177.96 PASS AC=3;AF=0.375;AN=8;BaseQRankSum=-3.580e-01;ClippingRankSum=-1.231e+00;DP=17;FS=0.000;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=-1.231e+00;QD=25.42;ReadPosRankSum=0.358;SOR=0.180 GT:AD:DP:GQ:PGT:PID:PL 0/0:3,0:3:0:.:.:0,0,48 0/1:2,3:5:99:0|1:351879_C_G:117,0,135 0/0:7,0:7:0:.:.:0,0,213 1/1:0,2:2:6:1|1:351879_C_G:90,6,0

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    This article really helped me understand phasing (have a look at "More detailed aspects of semantics of phasing in the VCF format"). Although the article is not specifically about HaplotypeCaller, it will help with understanding phasing.

    As for the last sample being phased at position chr1:8864426 with chr1 8864083, it is actually not phased at those positions. If you look at the variant sites after chr1:8864426, you should find another site with the same PID as chr1:8864426. That is the site chr1:8864426 is phased with.

    I hope this helps.


  • rubirubi Member

    Hi Sheila,,
    Thanks a lot for the response.

    The article says that RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample and that the preceding position should be NON-FILTERED.

    In my case the phase information is already present before any filtration. Does this means that VariantFiltration is supposed remove the phase information in sites for which their preceding sites are filtered? I haven't experienced that as I show in my second example.

    While sites after chr1:8864426 may also be phased with it, and therefore have the chr1:8864426 PID, according to the article since site chr1:8864426 has phase information it means that it is phased relative to site chr1:8864083 which precedes it. So it's just not clear to my how can site chr1:8864426 be phased chr1:8864083 when chr1:8864083 appears with this INFO: ./.:2,0:2

    Wouldn't it be most intuitive if the PID field of each phased site indicated exactly which site it is phased relative to?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    HaplotypeCaller outputs phasing information in a slightly different way than ReadBackedPhasing. Because you have not done any filtering yet, none of the sites will have failed filtering yet. While ReadBackedPhasing does phase sites relative to previous sites, HaplotypeCaller uses the PID to identify the phased sites.

    In your case, the sites after chr1:8864426 that have PID of 8864426_C_T are the sites chr1:8864426 is phased with. Does this make sense?

    I will ask the team about having the PID field indicate exactly which site it is phased relative to.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    VariantFiltration and VQSR are all agnostic of phasing information. This is not optimal and may be addressed in future development. The note in the RBP doc is just meant to indicate that if a site is filtered, it will be ignored during phasing.

    We have no plans to change the representation of phasing at this time, sorry.

  • rubirubi Member

    Thanks a lot for the answer Sheila and Geraldine.

    Would it be correct to conclude that WRT the output of HaplotypeCaller one can only trust the phasing of positions with PIDs indicating a preceding position (e.g., positions following chr1:8864426 with PID: 8864426_C_T), yet cannot trust the phasing of positions in which the PID indicates its own site, such as position chr1:8864426 with its PID 8864426_C_T?

    Thanks a lot,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi rubi, it's not so much a question of trust as it is of how meaningful the information is.

    A philosopher could argue that to say of a site that it is phased with itself is the most true statement one can make about its phasing status. However an analyst would then be justified to tell the philosopher to go take a long walk off a very short cliff.

    That is to say, you can trust that information, but it is useless.

    In contrast, phasing information that tells you how a site's phase relates to one preceding it (or following it!) is inherently less reliable (because there are many ways that it could be incorrect) but it is a heck of a lot more useful.

    I mention explicitly the case where your variant of interest is the first in the chain because there is no directionality to the relationship; the first in the chain is just the first by an accident of convention (which way we read the sequence), and the fact that it is the "anchor" in the relationship, as opposed to the follower, should not influence how you estimate its trustworthiness.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Rubi,

    You mentioned earlier that the phase information from the GVCF seems to be retained in the final VCF. For example, when one of the phased sites is not in the final VCF, the PID may still reflect the phasing. Can you submit a test case where this happens? Instructions are here.


Sign In or Register to comment.