Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How can a homozygous reference 0/0 genotype (GT) have a heterozygous phased genotype (PGT) of 0|1?

Dear GATK support team,

Multi-sample variant calling files made with GATK4 have both genotype, genotype phase and genotype phase block information.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">

If I run the following bcftools query I can get the 3 above fields for each sample.

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT\t%PGT\t%PID]\n' my_samples.vcf.gz

Combinations of GT, PGT and PID make sense to make. Heterozygous genotypes (0/1) can be both on the paternal (e.g. 0|1) or maternal haplotype (e.g. 1|0).

0/1     0|1     748_A_C
0/1     1|0     627_C_T

But how can a homozygous genotype (0/0) be phased on one of the parental haplotypes (e.g. 0|1)?
There is no alternative allele (=1) in the homozygous genotype (0/0)?

0/0 0|1 627_C_T

The above 0/0 genotype combined with 0|1 genotype phase does not make sense to me. Do you know what this means? How to interpret this. To me it just seems nonsens, the 0|1 627_C_T information can be discarded?

Thank you.

Answers

  • WimSWimS Member ✭✭

    @bhanuGandham Could you help with (having) this figured out? Thank you.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @WimS

    I will get back to you shortly.

  • WimSWimS Member ✭✭

    Hi @bhanuGandham do you already have an update/answer? Thank you.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @WimS

    1. Can you please post a few records from the vcf of the above mentioned inconsistency ?
    2. How many sites are you seeing this issue at ?
  • WimSWimS Member ✭✭
    edited October 10

    hi @bhanuGandham

    In a recent variant calling with 9 samples were we found c.a. 8M variants I have the following numbers:

    Number of variants with at least 1 phased genotype (i.e. PGT is in FORMAT)
    3.836.189

    Number of variants with at least 1 genotype were the allele set is different in the GT and PGT fields:
    287.738

    Number of genotypes were the allele set is different in the GT and PGT fields (i.e. more then 1 GT to PGT allele set mismatch can be present per variant):
    385.911

    So this is not just about a hand full of variants and genotypes.
    But in c.a 7% of all the phased variants in this variant calling.

    Either I am misunderstanding the PGT field or something going wrong frequently in creation or processing of the VCF records.
    GATK Version="4.1.1.0 via bcbio 1.1.5 was used to do the mapping and variant calling of the data.

    There are examples with GT = 0|0 and PGT = 0|1

    Chr_11  286   .       ACT     A       119.43  GATKCutoffIndel AC=2;AF=0.111;AN=18;BaseQRankSum=0.439;ClippingRankSum=-0.387;DP=126;ExcessHet=3.2736;FS=11.738;MLEAC=2;MLEAF=0.111;MQ=57.41;MQ0=0;MQRankSum=-3.615;QD=4.12;ReadPosRankSum=-1.304;SOR=3.626     GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:18,0:18:51:.:.:0,51,754:.   0/0:12,0:12:30:.:.:0,30,450:.   0|0:15,1:16:3:0|1:286_ACT_A:0,3,610:286     0/0:9,0:9:24:.:.:0,24,274:.     0|1:8,3:11:99:0|1:286_ACT_A:99,0,369:286    0/0:11,0:11:30:.:.:0,30,450:.   0/0:14,0:14:39:.:.:0,39,585:.   0/0:15,0:15:45:.:.:0,45,510:.   0|1:16,2:18:36:0|1:286_ACT_A:36,0,666:286
    
    0|0:15,1:16:3:0|1:286_ACT_A:0,3,610:286
    
    Chr_11  5187   .    T       C       199.48  PASS    AC=1;AF=0.056;AN=18;BaseQRankSum=-0.974;ClippingRankSum=5;DP=134;ExcessHet=3.2451;FS=1.674;MLEAC=1;MLEAF=0.056;MQ=57.44;MQ0=0;MQRankSum=0;QD=9.97;ReadPosRankSum=1.36;SOR=1.371 GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:22,0:22:54:.:.:0,54,683:.   0/0:11,0:11:30:.:.:0,30,428:.   0/0:15,0:15:42:.:.:0,42,530:.   0/0:5,0:5:15:.:.:0,15,178:.     0|1:14,6:20:99:0|1:5187_T_C:210,0,732:5187    0/0:6,0:6:10:.:.:0,10,201:.     0|0:23,1:24:27:0|1:5187_T_C:0,27,1115:5187    0/0:14,0:14:42:.:.:0,42,449:.   0/0:15,0:15:0:.:.:0,0,417:.
    
    0|0:23,1:24:27:0|1:5187_T_C:0,27,1115:5187
    

    There are examples with GT = 0|2 and PGT = 0|1

    Chr_10  2833  .    G       C,A     153.74  PASS    AC=1,1;AF=0.056,0.056;AN=18;BaseQRankSum=1.44;ClippingRankSum=1.44;DP=114;ExcessHet=3.2736;FS=3.522;MLEAC=1,1;MLEAF=0.056,0.056;MQ=39.39;MQ0=0;MQRankSum=-1.036;QD=15.37;ReadPosRankSum=0.967;SOR=3.056 GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:15,0,0:15:42:.:.:0,42,578,42,578,578:.      0/1:2,2,0:4:59:.:.:59,0,74,66,80,145:.  0|2:3,0,3:6:99:0|1:2833_G_A:117,126,252,0,126,117:2833      0/0:14,0,0:14:36:.:.:0,36,540,36,540,540:.      0/0:22,0,0:22:60:.:.:0,60,876,60,876,876:.      0/0:16,0,0:16:45:.:.:0,45,675,45,675,675:.      0/0:17,0,0:17:40:.:.:0,40,594,40,594,594:.      0/0:10,0,0:10:30:.:.:0,30,350,30,350,350:.      0/0:9,0,0:9:24:.:.:0,24,360,24,360,360:.
    
    0|2:3,0,3:6:99:0|1:2833_G_A:117,126,252,0,126,117:2833
    

    There are examples with GT = 0|3 and PGT = 0|1

    Chr_10  7782894 .       A       G,T,*   1881.92 PASS    AC=3,5,1;AF=0.167,0.278,0.056;AN=18;BaseQRankSum=0.637;ClippingRankSum=1.01;DP=162;ExcessHet=9.7754;FS=30.903;MLEAC=3,5,1;MLEAF=0.167,0.278,0.056;MQ=51.4;MQ0=0;MQRankSum=-3.923;QD=13.07;ReadPosRankSum=-0.652;SOR=2.457       GT:AD:DP:GQ:PGT:PID:PL:PS       0|1:4,3,0,0:7:99:0|1:7782857_T_G:114,0,184,126,193,319,126,193,319,319:7782857  0/0:8,0,0,0:8:24:.:.:0,24,244,24,244,244,24,244,244,244:.       1|1:0,17,0,0:17:54:1|1:7782689_A_C:810,54,0,810,54,810,810,54,810,810:7782689   0/2:14,5,7,0:26:84:.:.:255,84,655,0,381,565,294,671,587,880:.   0/2:12,0,8,5:25:99:.:.:495,521,1020,0,501,457,148,652,301,622:. 0/2:8,4,4,0:16:5:.:.:146,5,371,0,215,364,171,383,377,547:.      0|3:9,0,0,7:16:99:0|1:7782857_T_G:290,317,690,317,690,690,0,373,373,349:7782857 0/2:7,0,6,3:16:99:.:.:426,434,789,0,358,325,152,510,247,489:.   0/2:9,5,7,0:21:71:.:.:272,71,429,0,173,355,289,447,377,662:.
    
    0|3:9,0,0,7:16:99:0|1:7782857_T_G:290,317,690,317,690,690,0,373,373,349:7782857
    

    There are examples with GT = 2|2 and PGT = 1|1.

    Chr_08  694087        .       A       *,G     172.35  PASS    AC=5,2;AF=0.313,0.125;AN=16;DP=52;ExcessHet=0.1809;FS=0;MLEAC=5,2;MLEAF=0.313,0.125;MQ=45.83;MQ0=0;QD=14.36;SOR=0.836   GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:8,0,0:8:21:.:.:0,21,238,21,238,238:.        1/1:0,6,0:6:18:.:.:256,18,0,256,18,256:.        2|2:0,0,4:4:12:1|1:53694056_A_C:180,180,180,12,12,0:53694056    0/1:2,1,0:3:29:.:.:29,0,323,35,326,361:.  1/1:0,2,0:2:6:.:.:76,6,0,76,6,76:.      0/0:13,0,0:13:0:.:.:0,0,433,0,433,433:. .:0,0,0:.:.:.:.:.:.     0/0:4,0,0:4:12:.:.:0,12,80,12,80,80:.   0/0:11,0,0:11:30:.:.:0,30,372,30,372,372:.
    
    2|2:0,0,4:4:12:1|1:53694056_A_C:180,180,180,12,12,0:53694056    
    
    Chr_12  544179        .     C       T,A     692.34  PASS    AC=3,2;AF=0.167,0.111;AN=18;BaseQRankSum=1.46;ClippingRankSum=-0.61;DP=125;ExcessHet=2.1085;FS=2.099;MLEAC=3,2;MLEAF=0.167,0.111;MQ=59.14;MQ0=0;MQRankSum=0;QD=11.54;ReadPosRankSum=0.072;SOR=0.595       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:16,2,0:18:25:.:.:25,0,631,74,637,711:.      0/0:12,0,0:12:36:.:.:0,36,384,36,384,384:.        2|2:0,0,10:10:33:1|1:56544175_C_T:476,476,476,33,33,0:56544175  0/0:7,0,0:7:21:.:.:0,21,278,21,278,278:.        0/1:12,3,0:15:88:.:.:88,0,455,125,464,589:.       0/0:14,1,0:15:0:.:.:0,0,462,42,465,507:.        0/0:20,0,0:20:14:.:.:0,14,650,14,650,650:.      0/0:10,0,0:10:0:.:.:0,0,261,0,261,261:. 0/1:12,5,0:17:99:.:.:136,0,430,172,444,617:.
    
    2|2:0,0,10:10:33:1|1:56544175_C_T:476,476,476,33,33,0:56544175
    

    The c.a. 287K variants with a phasing issue kind of split out like following.
    I did not account for variants where there are multiple different phasing issue.
    I just wrote them to into the first category that I checked for in the code.
    wc -l *vcf

    * 287.800 hom_gt_het_phase_issue.vcf  (=total)
    * 25.727 hom_gt_het_phase_issue_het02_to_het01.vcf
    * 8.954 hom_gt_het_phase_issue_hom22_to_hom11.vcf
    * 257.692 hom_gt_het_phase_issue_homref_to_het.vcf
    * 423 hom_gt_het_phase_issue_other.vcf
    

    So most variants with a phasing issue have GT=0|0 and PGT=0|1.

    I hope this information is useful.

  • WimSWimS Member ✭✭

    Here is the CYVCF2 code that I used to get the above examples and numbers.

    The most important bit is:

    if gt_allele_set == {0} and pgt_allele_set == {0,1}:
        variant_with_phase_homref_to_het_issue = True
    if gt_allele_set == {2} and pgt_allele_set == {1}:
        variant_with_phase_hom22_to_hom11_issue = True
    elif gt_allele_set == {0,2} and pgt_allele_set == {0,1}:
        variant_with_phase_het02_to_het01_issue = True
    

    You could run the same or similar code on your internal data to see if you can reproduce the issue/find internal examples.

    from cyvcf2 import VCF, Writer
    
    path = "/DA_1458/VSDA_1458-gatk-haplotype-joint-annotated.bcf"
    output_path_hom_ref_to_het = "/DA_1458/hom_gt_het_phase_issue/hom_gt_het_phase_issue_homref_to_het.vcf"
    output_path_hom22_to_hom11 = "/DA_1458/hom_gt_het_phase_issue/hom_gt_het_phase_issue_hom22_to_hom11.vcf"
    output_path_het02_to_het01 = "/DA_1458/hom_gt_het_phase_issue/hom_gt_het_phase_issue_het02_to_het01.vcf"
    output_path_other = "/DA_1458/hom_gt_het_phase_issue/hom_gt_het_phase_issue_other.vcf"
    
    writer_hom_ref_to_het = Writer(output_path_hom_ref_to_het, VCF(path))
    writer_hom22_to_hom11 = Writer(output_path_hom22_to_hom11, VCF(path))
    writer_het02_to_het_01 = Writer(output_path_het02_to_het01, VCF(path))
    writer_other = Writer(output_path_other, VCF(path))
    
    
    writer_hom_ref_to_het.write_header()
    writer_hom22_to_hom11.write_header()
    writer_other.write_header()
    
    
    count_gt_unequal_pgt = 0
    count_variant_with_phased_gt = 0
    count_variant_with_gt_unequal_pgt = 0
    
    for variant in VCF(path):
    
        variant_with_phase_issue = False
        variant_with_phase_homref_to_het_issue = False
        variant_with_phase_hom22_to_hom11_issue = False
        variant_with_phase_het02_to_het01_issue = False
    
        if 'PGT' in variant.FORMAT:
    
            count_variant_with_phased_gt += 1
    
            phased_genotypes = variant.format('PGT')
    
            gt_count = 0
    
            for genotype in variant.genotypes:
    
                # If the genotype is called and phased
                if genotype[0] != -1 and genotype[2]:
    
                    # Get the allele set from the GT and PGT field
                    gt_allele_set = set(genotype[0:2])
                    pgt_allele_set = set(int(gt) for gt in phased_genotypes[gt_count].split("|"))
    
                    if gt_allele_set != pgt_allele_set:
    
                        variant_with_phase_issue = True
                        count_gt_unequal_pgt += 1
    
                        if gt_allele_set == {0} and pgt_allele_set == {0,1}:
                            variant_with_phase_homref_to_het_issue = True
                        elif gt_allele_set == {2} and pgt_allele_set == {1}:
                            variant_with_phase_hom22_to_hom11_issue = True
                        elif gt_allele_set == {0,2} and pgt_allele_set == {0,1}:
                            variant_with_phase_het02_to_het01_issue = True
    
    
    
                # increment the counter to see at which gt we are
                gt_count += 1
    
        if variant_with_phase_issue:
            if not variant_with_phase_homref_to_het_issue and not variant_with_phase_hom22_to_hom11_issue and not variant_with_phase_het02_to_het01_issue:
                writer_other.write_record(variant)
    
            if variant_with_phase_homref_to_het_issue:
                writer_hom_ref_to_het.write_record(variant)
            if variant_with_phase_hom22_to_hom11_issue:
                writer_hom22_to_hom11.write_record(variant)
            if variant_with_phase_het02_to_het01_issue:
                writer_het02_to_het_01.write_record(variant)
    
            count_variant_with_gt_unequal_pgt += 1
    
    
    print(count_variant_with_phased_gt)
    print("\n")
    print(count_variant_with_gt_unequal_pgt)
    print("\n")
    print(count_gt_unequal_pgt)
    print("\n")
    print("\n")
    
  • WimSWimS Member ✭✭

    Hi @bhanuGandham can you please check with the developers if the issue that I am having is indeed related to the issue and the pull request hat I mention in the above comment?

    And could you also ask what the progress is on this issue.

    We were planning to start using the phased genotype information.
    But now I am not sure if the phased genotype information is reliable enough to be used.
    Would you (and the GATK developers) recommend that users use the phased genotype information? Or that they ignore this data at this time? Until the linked issue is fixed?

  • WimSWimS Member ✭✭
    edited October 18

    @droazen
    Thank you and ldgauthier for picking up the ticket on github.
    Asking here an extra follow up question because I I temporary don't have access to my github account until after the weekend.

    If PGT should always be heterozygous. Why do I find these cases where the GT is 2|2 and the PGT is 1|1?
    And how should the PGT field be interpreted if it does not contain the actual alleles from the GT?

    I find almost 9 thousand variants with GT=2|2 to PGT=1|1 out of the 287K variants that have the GT to PGT discordance.

    • 287.800 hom_gt_het_phase_issue.vcf (=total)
    • 25.727 hom_gt_het_phase_issue_het02_to_het01.vcf
    • 8.954 hom_gt_het_phase_issue_hom22_to_hom11.vcf
    • 257.692 hom_gt_het_phase_issue_homref_to_het.vcf
    • 423 hom_gt_het_phase_issue_other.vcf

    See these 2 examples

    Chr_08  694087        .       A       *,G     172.35  PASS    AC=5,2;AF=0.313,0.125;AN=16;DP=52;ExcessHet=0.1809;FS=0;MLEAC=5,2;MLEAF=0.313,0.125;MQ=45.83;MQ0=0;QD=14.36;SOR=0.836   GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:8,0,0:8:21:.:.:0,21,238,21,238,238:.        1/1:0,6,0:6:18:.:.:256,18,0,256,18,256:.        2|2:0,0,4:4:12:1|1:53694056_A_C:180,180,180,12,12,0:53694056    0/1:2,1,0:3:29:.:.:29,0,323,35,326,361:.  1/1:0,2,0:2:6:.:.:76,6,0,76,6,76:.      0/0:13,0,0:13:0:.:.:0,0,433,0,433,433:. .:0,0,0:.:.:.:.:.:.     0/0:4,0,0:4:12:.:.:0,12,80,12,80,80:.   0/0:11,0,0:11:30:.:.:0,30,372,30,372,372:.
    
    2|2:0,0,4:4:12:1|1:53694056_A_C:180,180,180,12,12,0:53694056    
    
    Chr_12  544179        .     C       T,A     692.34  PASS    AC=3,2;AF=0.167,0.111;AN=18;BaseQRankSum=1.46;ClippingRankSum=-0.61;DP=125;ExcessHet=2.1085;FS=2.099;MLEAC=3,2;MLEAF=0.167,0.111;MQ=59.14;MQ0=0;MQRankSum=0;QD=11.54;ReadPosRankSum=0.072;SOR=0.595       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:16,2,0:18:25:.:.:25,0,631,74,637,711:.      0/0:12,0,0:12:36:.:.:0,36,384,36,384,384:.        2|2:0,0,10:10:33:1|1:56544175_C_T:476,476,476,33,33,0:56544175  0/0:7,0,0:7:21:.:.:0,21,278,21,278,278:.        0/1:12,3,0:15:88:.:.:88,0,455,125,464,589:.       0/0:14,1,0:15:0:.:.:0,0,462,42,465,507:.        0/0:20,0,0:20:14:.:.:0,14,650,14,650,650:.      0/0:10,0,0:10:0:.:.:0,0,261,0,261,261:. 0/1:12,5,0:17:99:.:.:136,0,430,172,444,617:.
    
    2|2:0,0,10:10:33:1|1:56544175_C_T:476,476,476,33,33,0:56544175
    
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    This question is being handled in this issue ticket: https://github.com/broadinstitute/gatk/issues/6220

Sign In or Register to comment.