Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Any statistic to measure the "degree' of heterozygosity per site?

cjy8709cjy8709 CornellMember
edited June 2014 in Ask the GATK team

Hi folks
I had a somewhat strange question to ask people and get some opinion and see if this makes sense. I've followed the bestpractice guide and now at a stage where I'm looking at my SNP vcf file any interpret the data. I've sequenced a non-model Drosophila specifically I've pooled a couple of progeny from a single line and sequenced the genome. I think this pooling is biting me back since the het sites I'm getting back are not exactly following the traditional definiton of hets and homozygous variation (ie. homozygosity (0%/100%) or heterozygosity (50%))
For example the het site found in the line below follows the traditional definition of heterozygosity (50%ref and 50%alt):

genome 1428318 . G A 392.68 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=0.207;ClippingRankSum=-8.860e-01;DP=259;FS=3.192;MLEAC=1;MLEAF=0.071;MQ=60.00;MQ0=0;MQRankSum=0.697;QD=12.27;ReadPosRankSum=1.11 GT:AD:DP:GQ:PL 0/0:.:32:64:0,64,1395 0/0:.:43:99:0,120,1800 0/0:.:23:62:0,62,990 0/1:16,16:32:99:426,0,422 0/0:.:28:67:0,67,1080 0/0:.:56:99:0,99,1800 0/0:.:45:99:0,120,1800

But something like this site doesn't look like the traditional het site:

genome 1430207 . C T 608.68 PASS AC=1;AF=0.071;AN=14;BaseQRankSum=1.57;ClippingRankSum=-2.430e-01;DP=322;FS=1.962;MLEAC=1;MLEAF=0.071;MQ=60.00;MQ0=0;MQRankSum=0.172;QD=5.85;ReadPosRankSum=0.799 GT:AD:DP:GQ:PL 0/0:.:21:60:0,60,900 0/1:81,23:104:99:642,0,2872 0/0:.:40:62:0,62,1485 0/0:.:28:66:0,66,990 0/0:.:45:71:0,71,1620 0/0:.:24:64:0,64,990 0/0:.:60:92:0,92,1800

So from here is there any way to measure this difference? One thing I was thinking was for all het sites I was wondering if taking the ratio of the numbers reported for the PL field would be a statistic to look at the degree of heterozygosity per site. Specifically the ratio of PL for 0/0 and 1/1 genotype would be ~1 if there are 50% ref and 50% alt reads whereas a deviation from 1 would indicate otherwise. I was planning to plot this out and see if the ratios cluster around 1 or if it looks different. Would this be a wrong way of using the PL field and are there any other ways to see the so called "degree" of heterozygosit per site? I'm sorry if I sound too confusing.

Thanks!

Post edited by cjy8709 on
Tagged:

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cjy8709‌

    Hi,

    The best thing to do here is to use Unified Genotyper and input the ploidy. The input ploidy will be the number of individuals * their actual ploidy. This way you can see all possible variation that exists and their values.

    -Sheila

  • cjy8709cjy8709 CornellMember

    Hi Sheila

    Thanks for the reply. I've been thinking about changing the ploidy first but as you can see I have multiple samples (and each would then have different ploidy) thus I thought maybe taking the ratio of PL values under a diploid would be a sort of standaridzed way.

    In you opinion then since I have multiple samples with potentially different ploidy (a.k.a differnt number of individuals were prepared for each library) here would be it better to genotype my multiple samples under a range of ploidy and hopefully the GT I get in the end doesn't fluctuate to much?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cjy8709‌

    Hi,

    Do you know what number of individuals were in each library prep?

    -Sheila

  • cjy8709cjy8709 CornellMember

    @Sheila So some I do know but others unfortunately I don't.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cjy8709‌

    Ah, that makes it harder.

    Maybe instead of trying to use Unified Genotyper with polyploidy, you can use Haplotype Caller in GVCF mode. This will output all the observed alleles. Then you can decide how to deal with anything that's triallelic or more.

    -Sheila

  • cjy8709cjy8709 CornellMember

    @Sheila

    I'm alittle bit confused on what you were meaning using the HC for calling SNPs. Doesn't HC in the end call your variant as a 0/0, 0/1, 1/1?
    I should have said that all the heterozygous snps I have are biallelic and if all the heterozygous sites are called as 0/1 it goes back to my original question of how to differentiate a heterozygous site that has 50% ref 50% alt reads vs for example 20% ref 80 % alt reads.

    I'm sorry if I'm causing more confusion.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cjy8709‌

    Hi,

    When you output a GVCF file, it gives you all of the possible variants that are seen at the site. So, instead of only one variant allele, you can see all the variant alleles that are present. It also outputs the PL for each of those variant alleles. I think this will help you.

    You can read about GVCF here: http://www.broadinstitute.org/gatk/guide/article?id=4017

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Also, when calling in normal mode, you force the caller to choose a biallelic configuration, suppressing any additional alleles that may be real. That's why GVCF mode is helpful: it shows you what would otherwise be hidden.

  • cjy8709cjy8709 CornellMember

    @Sheila

    Oh I see the new HC kind of calls like a polyploid during the GVCF output, but I guess later on in the joint analysis is when it turns into a diploid....
    Ok I think I have some idea on what to do now.
    Thanks again for the help!

Sign In or Register to comment.