The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# GATK: basic VCF indel output question

Posts: 4Member
edited May 2013

Hi,

I'm having problems understanding a GATK output VCF. I have read the VCF standard, but I'm obviously missing something.

I /think/ I understand how SNPs and short indels are represented, but clearly I do not. Below is an excerpt that illustrates sites which I do not understand. I suspect it may be something to do with GATK quality filters that I'm not understanding, or something about using EMIT_ALL_SITES...

The excerpt below was generated using

GATK -l INFO -I my.bam -R my.fa -T UnifiedGenotyper -S LENIENT -nt 8 --heterozygosity 0.1 -o test.vcf --genotype_likelihoods_model BOTH --min_base_quality_score 10 --output_mode EMIT_ALL_SITES -ploidy 2

Thanks!

Darren

    CH1 225 .   T   G   12.71   LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.978;DP=59;Dels=0.03;FS=0.000;HaplotypeScore=10.2840;MLEAC=1;MLEAF=0.500;MQ=70.25;MQ0=8;MQRankSum=-5.349;QD=0.22;ReadPosRankSum=-3.188 GT:AD:DP:GQ:PL  0/1:41,16:55:20:20,0,1435
CH1 226 .   T   .   121.53  .   AN=2;DP=59;MQ=70.25;MQ0=8   GT:DP   0/0:43
CH1 227 .   A   .   121.53  .   AN=2;DP=59;MQ=70.25;MQ0=8   GT:DP   0/0:43
CH1 228 .   T   .   121.53  .   AN=2;DP=59;MQ=70.25;MQ0=8   GT:DP   0/0:43
CH1 229 .   A   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:38
CH1 230 .   C   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:38
CH1 231 .   T   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:36
CH1 232 .   G   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:36
CH1 233 .   C   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:37
CH1 234 .   A   .   139.53  .   AN=2;DP=70;MQ=59.20;MQ0=14  GT:DP   0/0:63
CH1 235 .   A   .   175.53  .   AN=2;DP=84;MQ=51.67;MQ0=15  GT:DP   0/0:79
CH1 236 .   A   .   175.53  .   AN=2;DP=84;MQ=51.67;MQ0=15  GT:DP   0/0:79
CH1 237 .   T   .   175.53  .   AN=2;DP=85;MQ=51.37;MQ0=16  GT:DP   0/0:80
CH1 238 .   A   .   175.53  .   AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP   0/0:97
CH1 239 .   A   .   175.53  .   AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP   0/0:101
CH1 240 .   T   .   175.53  .   AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP   0/0:98
CH1 241 .   A   .   169.53  .   AN=2;DP=108;MQ=44.14;MQ0=29 GT:DP   0/0:107
*    CH1    242 .   T   .   169.53  .   AN=2;DP=109;MQ=43.94;MQ0=29 GT:DP   0/0:103
*    CH1    242 .   T   .   118.27  .   AN=2;DP=109;MQ=43.94;MQ0=29 GT:AD:DP    0/0:27:55
*    CH1    243 .   C   .   172.53  .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:108
*    CH1    243 .   CTTTT   .   118.27  .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:AD:DP    0/0:27:56
CH1 244 .   T   .   91.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:61
CH1 245 .   T   .   91.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:53
CH1 246 .   T   .   73.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:41
CH1 247 .   T   .   91.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:46
CH1 248 .   A   .   172.53  .   AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP   0/0:100
CH1 249 .   A   .   172.53  .   AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP   0/0:100
CH1 250 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:101
CH1 251 .   T   .   169.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:96
CH1 251 .   T   .   118.27  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:AD:DP    0/0:27:56
CH1 252 .   C   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:113
CH1 253 .   C   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:110
CH1 254 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:111
CH1 255 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:111
CH1 256 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:111


Line 1 is a SNP
Lines 14 and 15 are an indel that I do understand
Lines 19 and 20 I do /not/ understand
Lines 21 and 22 I do /not/ understand

Post edited by Carneiro on
Tagged:

• Posts: 4Member

OK, clearly that didn't work. The excerpt is now attached as a file. - apologies!

sorry I had to edit your post to understand what you were asking. Let me rephrase your question here to make sure I understand it now:

You want to know why the lines marked with "*" have duplicated information for the same position ? (the actual event is clear -- a T deletion, or a CTTTTT deletion -- it's the fact that you have multiple events in the same location that seems to be wrong here).

Is this correct?

• Posts: 4Member

Hi!

Thank you for editing the question! You have my incomprehension encapsulated perfectly - and perhaps I should have been able to work it out for myself.

However! I still do not fully understand:

If

• CH1 242 . T . 169.53 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:DP 0/0:103
• CH1 242 . T . 118.27 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:AD:DP 0/0:27:55 - See more at: http://gatkforums.broadinstitute.org

represents a polymorphic indel (T/.) at position 242 (which I should have been able to work out for myself), then I have two supplementary questions:

(1) Why is the genotype field '0/0' rather than '0/1' or '1/1' ?

Since this is the only (diploid) individual in the dataset, for there to be an indel recorded at this site, the indel /must/ be heterozygous or homozygous deletion, as there is no external indel information for these data.

(2) Why is the AD field '27' rather than something like '20,7'? This entry suggests to me that the 'T' call was seen 27 times, and the call seen zero times.

But if the deletion was seen zero times, how come it's there at all as an ALT allele? My first guess is that I don't understand a quality issue (population DP is 109, whereas AD is 27). But there were clearly 55 quality calls of the deleted allele, since individual-level DP=55.

Thanks again for any advice you can give me! it's much appreciated!

Darren

• Posts: 4Member

Great! Thank you for all your help.