Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Any statistic to measure the "degree' of heterozygosity per site?
I had a somewhat strange question to ask people and get some opinion and see if this makes sense. I've followed the bestpractice guide and now at a stage where I'm looking at my SNP vcf file any interpret the data. I've sequenced a non-model Drosophila specifically I've pooled a couple of progeny from a single line and sequenced the genome. I think this pooling is biting me back since the het sites I'm getting back are not exactly following the traditional definiton of hets and homozygous variation (ie. homozygosity (0%/100%) or heterozygosity (50%))
For example the het site found in the line below follows the traditional definition of heterozygosity (50%ref and 50%alt):
genome 1428318 . G A 392.68 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=0.207;ClippingRankSum=-8.860e-01;DP=259;FS=3.192;MLEAC=1;MLEAF=0.071;MQ=60.00;MQ0=0;MQRankSum=0.697;QD=12.27;ReadPosRankSum=1.11 GT:AD:DP:GQ:PL 0/0:.:32:64:0,64,1395 0/0:.:43:99:0,120,1800 0/0:.:23:62:0,62,990 0/1:16,16:32:99:426,0,422 0/0:.:28:67:0,67,1080 0/0:.:56:99:0,99,1800 0/0:.:45:99:0,120,1800
But something like this site doesn't look like the traditional het site:
genome 1430207 . C T 608.68 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=1.57;ClippingRankSum=-2.430e-01;DP=322;FS=1.962;MLEAC=1;MLEAF=0.071;MQ=60.00;MQ0=0;MQRankSum=0.172;QD=5.85;ReadPosRankSum=0.799 GT:AD:DP:GQ:PL 0/0:.:21:60:0,60,900 0/1:81,23:104:99:642,0,2872 0/0:.:40:62:0,62,1485 0/0:.:28:66:0,66,990 0/0:.:45:71:0,71,1620 0/0:.:24:64:0,64,990 0/0:.:60:92:0,92,1800
So from here is there any way to measure this difference? One thing I was thinking was for all het sites I was wondering if taking the ratio of the numbers reported for the PL field would be a statistic to look at the degree of heterozygosity per site. Specifically the ratio of PL for 0/0 and 1/1 genotype would be ~1 if there are 50% ref and 50% alt reads whereas a deviation from 1 would indicate otherwise. I was planning to plot this out and see if the ratios cluster around 1 or if it looks different. Would this be a wrong way of using the PL field and are there any other ways to see the so called "degree" of heterozygosit per site? I'm sorry if I sound too confusing.