It looks like you're new here. If you want to get involved, click one of these buttons!
Hi,
I'm having problems understanding a GATK output VCF. I have read the VCF standard, but I'm obviously missing something.
I /think/ I understand how SNPs and short indels are represented, but clearly I do not. Below is an excerpt that illustrates sites which I do not understand. I suspect it may be something to do with GATK quality filters that I'm not understanding, or something about using EMIT_ALL_SITES...
The excerpt below was generated using
GATK -l INFO -I my.bam -R my.fa -T UnifiedGenotyper -S LENIENT -nt 8 --heterozygosity 0.1 -o test.vcf --genotype_likelihoods_model BOTH --min_base_quality_score 10 --output_mode EMIT_ALL_SITES -ploidy 2
Thanks!
Darren
CH1 225 . T G 12.71 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.978;DP=59;Dels=0.03;FS=0.000;HaplotypeScore=10.2840;MLEAC=1;MLEAF=0.500;MQ=70.25;MQ0=8;MQRankSum=-5.349;QD=0.22;ReadPosRankSum=-3.188 GT:AD:DP:GQ:PL 0/1:41,16:55:20:20,0,1435 CH1 226 . T . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43 CH1 227 . A . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43 CH1 228 . T . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43 CH1 229 . A . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:38 CH1 230 . C . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:38 CH1 231 . T . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:36 CH1 232 . G . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:36 CH1 233 . C . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:37 CH1 234 . A . 139.53 . AN=2;DP=70;MQ=59.20;MQ0=14 GT:DP 0/0:63 CH1 235 . A . 175.53 . AN=2;DP=84;MQ=51.67;MQ0=15 GT:DP 0/0:79 CH1 236 . A . 175.53 . AN=2;DP=84;MQ=51.67;MQ0=15 GT:DP 0/0:79 CH1 237 . T . 175.53 . AN=2;DP=85;MQ=51.37;MQ0=16 GT:DP 0/0:80 CH1 238 . A . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:97 CH1 238 . A AGAAAGAAAGCTTGTA 83.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.172;DP=102;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=46.90;MQ0=0;MQRankSum=-6.190;QD=0.05;ReadPosRankSum=-5.733 GT:AD:DP:GQ:PL 0/1:27,25:57:99:121,0,4853 CH1 239 . A . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:101 CH1 240 . T . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:98 CH1 241 . A . 169.53 . AN=2;DP=108;MQ=44.14;MQ0=29 GT:DP 0/0:107 * CH1 242 . T . 169.53 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:DP 0/0:103 * CH1 242 . T . 118.27 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:AD:DP 0/0:27:55 * CH1 243 . C . 172.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:108 * CH1 243 . CTTTT . 118.27 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:AD:DP 0/0:27:56 CH1 244 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:61 CH1 245 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:53 CH1 246 . T . 73.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:41 CH1 247 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:46 CH1 248 . A . 172.53 . AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP 0/0:100 CH1 249 . A . 172.53 . AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP 0/0:100 CH1 250 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:101 CH1 251 . T . 169.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:96 CH1 251 . T . 118.27 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:AD:DP 0/0:27:56 CH1 252 . C . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:113 CH1 253 . C . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:110 CH1 254 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111 CH1 255 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111 CH1 256 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111
Line 1 is a SNP
Lines 14 and 15 are an indel that I do understand
Lines 19 and 20 I do /not/ understand
Lines 21 and 22 I do /not/ understand
Ah, okay, you have EMIT_ALL_SITES on.
this means that if the SNP mode finds a mutation but the INDEL mode doesn't, they will both be reported, one with the genotype and the other one without. In this case, both modes are finding REFERENCE (which is represented by 0/0) and are reporting it. In the other cases, it just happens that there is no other haplotypes supporting indels (probably) and only the SNP model is reporting the reference site.
the only variants you have in this data are #1 (SNP) and #15 (INSERTION). Everything else is reference, and the multiple calls on the same location are an artifact of using EMIT_ALL_SITES and GLM_BOTH. This is explained in the documentation of the Unified Genotyper.
Answers
OK, clearly that didn't work. The excerpt is now attached as a file. - apologies!
sorry I had to edit your post to understand what you were asking. Let me rephrase your question here to make sure I understand it now:
You want to know why the lines marked with "*" have duplicated information for the same position ? (the actual event is clear -- a T deletion, or a CTTTTT deletion -- it's the fact that you have multiple events in the same location that seems to be wrong here).
Is this correct?
Hi!
Thank you for editing the question! You have my incomprehension encapsulated perfectly - and perhaps I should have been able to work it out for myself.
However! I still do not fully understand:
If
represents a polymorphic indel (T/.) at position 242 (which I should have been able to work out for myself), then I have two supplementary questions:
(1) Why is the genotype field '0/0' rather than '0/1' or '1/1' ?
Since this is the only (diploid) individual in the dataset, for there to be an indel recorded at this site, the indel /must/ be heterozygous or homozygous deletion, as there is no external indel information for these data.
(2) Why is the AD field '27' rather than something like '20,7'? This entry suggests to me that the 'T' call was seen 27 times, and the call seen zero times.
But if the deletion was seen zero times, how come it's there at all as an ALT allele? My first guess is that I don't understand a quality issue (population DP is 109, whereas AD is 27). But there were clearly 55 quality calls of the deleted allele, since individual-level DP=55.
Thanks again for any advice you can give me! it's much appreciated!
Darren
Ah, okay, you have EMIT_ALL_SITES on.
this means that if the SNP mode finds a mutation but the INDEL mode doesn't, they will both be reported, one with the genotype and the other one without. In this case, both modes are finding REFERENCE (which is represented by 0/0) and are reporting it. In the other cases, it just happens that there is no other haplotypes supporting indels (probably) and only the SNP model is reporting the reference site.
the only variants you have in this data are #1 (SNP) and #15 (INSERTION). Everything else is reference, and the multiple calls on the same location are an artifact of using EMIT_ALL_SITES and GLM_BOTH. This is explained in the documentation of the Unified Genotyper.
Great! Thank you for all your help.