To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Determining ploidy via read ratios with HaplotypeCaller: skewed results?

mmats010mmats010 Riverside CAMember

Hello,
I am attempting to determine the ploidy of my organism by observing where the modes of SNP allele ratios fall. For example, a 2N diploid should have a peak of allele read-counts at 0.5 since most SNPs should have, for example, 15 reference and 15 alternate reads. A triploid should have two modal peaks around 0.33 and 0.66, and a tetraploid should have four peaks around 0.25, 0.5, and 0.75.

I am using HaplotypeCaller to call SNPs on a segregating cross in gVCF+joint genotyping mode, then using SelectVariants to keep biallelic SNPs and using the JEXL expression " -select 'vc.getGenotype("sample").isHet()' " to isolate only those SNPs called as heterozygous, then taking the "-GF AD" option on SelectVariants and parsing the allele depths in the resulting table, and finally calculating the alternate allele read depth divided by the total read depth to get my final "ratio".

While this workflow seems to work and give a modal peak around 0.5, every sample previously known as a 2N diploid I try this on shows a significant shoulder on the left side of the density plot I generate (see attached). When I apply this to an artificial tetraploid (two 2N diploid individuals merged) I can see the expected three modal peaks around 0.25, 0.5, and 0.75, but again, there is a very significant bias toward the left.

My question is: Is this normal for HaplotypeCaller? Is there a particular reason why it is calling SNPs heterozygous only on one end of the spectrum (keeping SNPs with a low alt/total ratio, but not a high)? Or should I try a different SNP calling tool?

Answers

  • valentinvalentin Cambridge, MAMember, Dev
    edited October 2016

    Don't know why this would be happening, HC applies a prior that favors HOM_REF over HET or HOM_ALT so I would expect the opposite effect. In other words, this is showing that, whatever the reason is, there is an unexplained higher than expected number of sites with AB around 15%.

    What can you tell use about those sites? For example, are they scattered across the whole genome or do they cluster in particular regions? Are these located in relatively noisy regions (high density in number of variant sites, soft-clips, discordant read-pair mapping. etc)?

  • mmats010mmats010 Riverside CAMember

    Hi Valentin,
    After doing some rough characterization of the example I attached, it seems like SNPs with unusually low ratios (around 0.2) cluster in small regions. However, the read depths of these small regions does not usually deviate from the average SNP coverage, so I'm inclined to think that a local duplication favoring the reference allele didn't happen.

    However, thanks for answering my fundamental question of whether or not this is an artifact of HC. If this is a real result, then thats alright I suppose. I just wanted to make sure that HC won't throw out sites that might be heterozygous

Sign In or Register to comment.