Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller Reference Confidence Model (GVCF mode)

This document describes the reference confidence model applied by HaplotypeCaller to generate a per-sample GVCF, invoked by -ERC GVCF or -ERC BP_RESOLUTION.

As explained here, HaplotypeCaller works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. At that point, we can calculate the likelihoods of each possible genotype and emit variant calls.

What that article does not explain is how HaplotypeCaller additionally estimates the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

  • Estimate the confidence that no SNP exists at the site by contrasting all reads with the REF base vs. all reads with any non-reference base.
  • Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models. We use a symbolic ALT allele, <NON_REF>, to hold the likelihood that the site is not homozygous reference, as well as allele-specific AD and PL field values.

We do this at all sites in the territory covered by the analysis, including homozygous-reference sites, both inside and outside the ActiveRegions determined by HaplotypeCaller.

Comments

Sign In or Register to comment.