We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

IS GT being reported correctly in positions spanned by a deletion following joint-genotyping?

SteveLSteveL BarcelonaMember ✭✭
edited December 2015 in Ask the GATK team

Hi team,

I am looking at an ~500 sample VCF that was produced as follows:

  1. WES-> 10 band gVCF (GATK 3.3)
  2. Combine gVCFs in batches of ~100 (GAT3.4-46)
  3. Genotype GVCfs on the output of 2 --> Using -L to restrict to certain regions and --allsites to force calls for all positions (GATK 3.5)

Unfortunately I had to use different versions of the toolkit as the original files were produced a few moths ago, and then with the combine-gVCF bug, I had to redo the joint-genotyping with 3-4-46 (to get the asterisks), and then used 3.5 for the last step. I am not sure if this may explain my observations in part, but I don't think so - if you do, then I will happily try to repeat the process for just the small section I describe, but I don't want to go to that effort if there is a good explanation, or it is a known issue.

I have been looking at some parts of the VCF where one sample has a largish deletion, and I am not sure if the GT is being reported correctly. In order not to post redundant clutter, below I have pasted just the relevant information for a deletion and the four immediately subsequent position, for two typical samples (that look more or less the same as the other 500), and then the sample that has the deletion. I have a couple of questions:

1) The first positions is fine, nice clean heterozygous deletion, and GT is 0/1 - not also AC=1. In the next position (1:899910), we have two observed alt-calls. However, the T obviously didn't reach threshold in the sample in which it was observed - I think it was 2 out of 46 reads, so this position gets treated as homo-ref across all samples - hence AC=0,0 (and the RGQ annotation) - but here the deletion is also called "0/0" - for me it would make more sense to be "0/1" - the 0 being ref on Chr1-copy1 and the 1 being the deletion, as indicated by the asterisk, on Chr1-copy2. In truth the actual call is just a haploid "0", but I understand why we don't want to mix diploid and haploid calls across a record. But maybe there is a another reason for this to be 0/0 - or is it because I am using bands in the gVCF but forcing all-sites in genotypeGVCF?

2) Moving on to position 1:899911 we have two calls again, but this time the T is seen as heterozgote in one of the other samples. But her the the deletion allele seems to be getting counted as if it was a real alt - it has a GT of 0/2, and AC=1,1. This seems inconsistent with what is observed in (1) above i.e. either it should be "0/1" in one, as I find more intuitive, or it should be 0/0 at both 899910. Can you confirm if this is expected behaviour? The final two positions are just simple positions where all samples are 0/0 apart from the sample that has the deletion - still with a GT of 0/0.

I have pasted both raw, and as code-block, which are both a bit tricky to read. Thanks for your help as always!

1 899909 . AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG A 705.33 . AC=1;AF=8.865e-04;AN=1128;BaseQRankSum=0.551;ClippingRankSum=1.50;DP=35503;FS=3.102;InbreedingCoeff=-0.0009;LikelihoodRankSum=-1.141e+00;MLEAC=1;MLEAF=8.865e-04;MQ=60.00;MQ0=0;MQRankSum=0.718;QD=9.16;ReadPosRankSum=-1.363e+00;SOR=1.265 GT:AD:DP:GQ:PGT:PID:PL 0/0:63,0:63:99:.:.:0,120,1800 0/0:46,0:46:99:.:.:0,101,1403 0/1:54,23:77:99:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:754,0,2184
1 899910 . G *,T . . AC=0,0;AF=0.00,0.00;AN=1128;BaseQRankSum=-1.804e+00;ClippingRankSum=0.189;DP=35503;LikelihoodRankSum=-8.100e-02;MQ=60.00;MQ0=0;MQRankSum=-3.500e-01;ReadPosRankSum=0.296 GT:AD:DP:PGT:PID:RGQ 0/0:63,0,0:63:.:.:99 0/0:46,0,0:46:.:.:99 0/0:54,23,0:77:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:99
1 899911 . G T,* 4648.83 . AC=1,1;AF=8.865e-04,8.865e-04;AN=1128;BaseQRankSum=-1.700e-01;ClippingRankSum=-1.150e-01;DP=35622;FS=28.957;InbreedingCoeff=-0.0018;LikelihoodRankSum=-4.610e-01;MLEAC=1,1;MLEAF=8.865e-04,8.865e-04;MQ=60.00;MQ0=0;MQRankSum=1.27;QD=18.52;ReadPosRankSum=2.37;SOR=2.569 GT:AD:DP:GQ:PGT:PID:PL 0/0:63,0,0:63:99:.:.:0,120,1800,120,1800,1800 0/0:46,0,0:46:99:.:.:0,101,1403,101,1403,1403 0/2:54,0,23:77:99:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:754,917,3184,0,2267,2184
1 899912 . T * . . AC=0;AF=0.00;AN=1128;DP=35696 GT:AD:DP:PGT:PID:RGQ 0/0:63,0:63:.:.:99 0/0:46,0:46:.:.:99 0/0:54,23:77:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:99
1 899913 . C * . . AC=0;AF=0.00;AN=1128;DP=35697 GT:AD:DP:PGT:PID:RGQ 0/0:63,0:63:.:.:99 0/0:46,0:46:.:.:99 0/0:54,23:77:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:99

1 899909 . AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG A 705.33 . AC=1;AF=8.865e-04;AN=1128;BaseQRankSum=0.551;ClippingRankSum=1.50;DP=35503;FS=3.102;InbreedingCoeff=-0.0009;LikelihoodRankSum=-1.141e+00;MLEAC=1;MLEAF=8.865e-04;MQ=60.00;MQ0=0;MQRankSum=0.718;QD=9.16;ReadPosRankSum=-1.363e+00;SOR=1.265 GT:AD:DP:GQ:PGT:PID:PL 0/0:63,0:63:99:.:.:0,120,1800 0/0:46,0:46:99:.:.:0,101,1403 0/1:54,23:77:99:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:754,0,2184 1 899910 . G *,T . . AC=0,0;AF=0.00,0.00;AN=1128;BaseQRankSum=-1.804e+00;ClippingRankSum=0.189;DP=35503;LikelihoodRankSum=-8.100e-02;MQ=60.00;MQ0=0;MQRankSum=-3.500e-01;ReadPosRankSum=0.296 GT:AD:DP:PGT:PID:RGQ 0/0:63,0,0:63:.:.:99 0/0:46,0,0:46:.:.:99 0/0:54,23,0:77:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:99 1 899911 . G T,* 4648.83 . AC=1,1;AF=8.865e-04,8.865e-04;AN=1128;BaseQRankSum=-1.700e-01;ClippingRankSum=-1.150e-01;DP=35622;FS=28.957;InbreedingCoeff=-0.0018;LikelihoodRankSum=-4.610e-01;MLEAC=1,1;MLEAF=8.865e-04,8.865e-04;MQ=60.00;MQ0=0;MQRankSum=1.27;QD=18.52;ReadPosRankSum=2.37;SOR=2.569 GT:AD:DP:GQ:PGT:PID:PL 0/0:63,0,0:63:99:.:.:0,120,1800,120,1800,1800 0/0:46,0,0:46:99:.:.:0,101,1403,101,1403,1403 0/2:54,0,23:77:99:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:754,917,3184,0,2267,2184 1 899912 . T * . . AC=0;AF=0.00;AN=1128;DP=35696 GT:AD:DP:PGT:PID:RGQ 0/0:63,0:63:.:.:99 0/0:46,0:46:.:.:99 0/0:54,23:77:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:99 1 899913 . C * . . AC=0;AF=0.00;AN=1128;DP=35697 GT:AD:DP:PGT:PID:RGQ 0/0:63,0:63:.:.:99 0/0:46,0:46:.:.:99 0/0:54,23:77:0|1:899909_AGGTCCGCAGTGGGGCTGCGGGGAGGGGGGCGCG_A:99

Issue · Github
by Sheila

Issue Number
425
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, we generally recommend against mixing versions within a pipeline. This is generally safe enough to do if you use one version for the data processing segment and one for the variant discovery segment, but splitting up the tools involved in variant discovery across versions is a much more dicey proposition because they are very tightly integrated, and usually a change in one component is accompanied by a change in the others.

    In this case, because what you're observing has to do with a spanning deletion, I would indeed suspect that the version mixing is responsible. I don't have the time right now to look at this in much detail but I would recommend testing that hypothesis by running the full discovery workflow (from the analysis-read bam) on the region surrounding the deletion. It shouldn't take very long and if the results look more sane then you'll know.

Sign In or Register to comment.