Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

PGT and PID is a dot

Hi,
I am following the best practice pipeline with version 3.6 of gatk and in order to reduce the amount of compound heteozygote variants found in my analysis, I recently chose to "mature" into using the phased gt annotations PID and PGT and I have very nice results.
However I found a high amount of variants with a dot ('.') in both fields in the vcfs I started to use the new analysis on.
I scanned around in the forums and documentation, however I could not find any sign what that means.
this is especially interesting, as I have several variants where the variants are on the same read only, however only a few of the variants are phased, the others show the '.'
I am pleased the with the outcome already however I fear that I might miss information.

I included the screenshot as well as an overview of the variants in that region.

I would appreciate it if you could shed some light on this issue, even if it is just " the dot means we were not sure about it"

chr1    152280671   .   A   C   3783.47 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=5.09;ClippingRankSum=0;DP=26522;ExcessHet=10.4471;FS=22.852;InbreedingCoeff=-0.0643;MLEAC=7;MLEAF=0.019;MQ=58.86;MQRankSum=-7.267;QD=1.96;ReadPosRankSum=-1.448;SOR=2.749;VQSLOD=-14.1;culprit=MQRankSum    GT:AD:DP:GQ:PGT:PID:PL  0/1:277,34:311:99:0|1:152280669_T_C:585,0,11704
chr1    152280685   .   C   A   9358.84 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=4.79;ClippingRankSum=0;DP=35550;ExcessHet=33.4707;FS=9.592;InbreedingCoeff=-0.1314;MLEAC=11;MLEAF=0.03;MQ=58.82;MQRankSum=-7.86;QD=3.99;ReadPosRankSum=-2.019;SOR=1.604;VQSLOD=-8.457;culprit=MQRankSum GT:AD:DP:GQ:PGT:PID:PL  0/1:299,66:365:99:0|1:152280669_T_C:1871,0,12440
chr1    152280687   .   C   T   9105.01 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=5.67;ClippingRankSum=0;DP=35912;ExcessHet=38.6724;FS=9.592;InbreedingCoeff=-0.1438;MLEAC=12;MLEAF=0.033;MQ=58.81;MQRankSum=-7.86;QD=3.89;ReadPosRankSum=-1.766;SOR=1.605;VQSLOD=-8.451;culprit=MQRankSum    GT:AD:DP:GQ:PGT:PID:PL  0/1:299,66:365:99:0|1:152280669_T_C:1840,0,13062
chr1    152280688   .   A   G   9486.72 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=9.54;ClippingRankSum=0;DP=36233;ExcessHet=39.4426;FS=9.592;InbreedingCoeff=-0.1455;MLEAC=8;MLEAF=0.022;MQ=58.83;MQRankSum=-7.765;QD=4.02;ReadPosRankSum=-1.182;SOR=1.63;VQSLOD=-8.617;culprit=MQRankSum GT:AD:DP:GQ:PGT:PID:PL  0/1:303,66:371:99:.:.:1926,0,13029
chr1    152280691   .   A   T   11612.7 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.025;AN=2;BaseQRankSum=6.18;ClippingRankSum=0;DP=38167;ExcessHet=38.2356;FS=4.867;InbreedingCoeff=-0.1452;MLEAC=13;MLEAF=0.036;MQ=58.89;MQRankSum=-10.18;QD=3.99;ReadPosRankSum=-1.247;SOR=1.173;VQSLOD=-9.026;culprit=MQRankSum   GT:AD:DP:GQ:PGT:PID:PL  0/1:281,110:391:99:.:.:3611,0,11146

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited January 2018

    Hi @SHollizeck,

    Any . field means uncertainty or lack of information. I suspect this arises for your string of variants (that should phase together) because they span beyond the kmer lengths (default 10 and 25) used in reassembly. Here I am making some assumptions about your data, based on the region of the IGV screenshot in view.

    • Variants in seeming phase based on read evidence (in the IGV screenshot) are A-CTC-ATGT over the span of 27 bases (152,228,665-152,228,691 inclusive).
    • Instead, the PID 152280669_T_C grouping shows only CTC is phased together and the A- and -ATGT are left out. Furthermore, the -ATGT grouping, despite their proximity to one another, do not appear to phase in a separate group. It seems the tool is aware there is a possibility that these SNPs phase together and so instead of omitting PGT:PID for records where phasing does not apply, it gives .:. values to them.

    You are using GATK v3.6.

    Here are some suggestions.

    • Try using a larger kmer size for the region in HaplotypeCaller
    • Try to see if ReadBackedPhasing improves phasing these sites
    • Try a later version of GATK. I hear HaplotypeCaller has finally been tied out for GATK4, which is set to officially release next week. You will have to adopt new syntax but it will be worth it for the performance improvements GATK4 brings. You have two options here. Either you will have to build your own jar by following instructions at https://github.com/broadinstitute/gatk, or you can use the nightly Docker at https://hub.docker.com/r/broadinstitute/gatk-nightly/. Instructions for getting started on Docker are in Article#10870.

    Finally, I think your case is of interest to our developers, towards improving our tools. Would you mind submitting some of your data so we can recapitulate your observations? A snippet of the BAM to produce the variant records in your IGV screenshot would be much appreciated. Directions for submitting data are in Article#1894.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    @SHollizeck in order to implement physical phasing in HaplotypeCaller in an efficient way, we made some assumptions. We output variants as being in cis phase if they are present together on all of the assembled haplotypes. We output variants as being in trans phase if they are never present together on any assembled haplotype. By default we assemble up to 128 haplotypes. It's hard for me to say exactly what's going on without seeing the full set of reads for this region, but given that I can see a stray T at 152,280,691 it seems likely that there is a haplotype with that T that doesn't contain the previous SNPs, preventing it from being phased. You can see the assembled haplotypes using the -bamout argument to see if my hypothesis is true. In my experience debugging missed MNP phasing, this is usually the case. If that is the case, you may be able to fix the phasing by increasing the -minPruning argument such that paths in the graph that contain a variant that doesn't phase perfectly are pruned out. I believe that ReadBackedPhasing has a more probabilistic treatment of phasing, which could help if your data is noisy. The phasing code has not been changed since GATK 3.4 so I expect the same results from the latest GATK4.

    In terms of the format of the output, I haven't seen .:. in the PGT:PID fields for a single sample VCF. Was this subset from a multi-sample VCF? Usually in the FORMAT field the dot is indicative of the fact that another sample had that annotation, but it is missing for the sample in question.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks @gauthier for clarifying the .:. could be from splitting a cohort VCF into a single-sample VCF. That makes more sense as a possibility.

  • Thank you very much for all the answers!
    It is indeed a vcf split after joined genotyping to be annotated with vep afterwards.
    I will test the bamout as well as minPruning and will report the results.

Sign In or Register to comment.