We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How the HaplotypeCaller's reference confidence model works

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF
or -ERC BP_RESOLUTION
(see the FAQ on gVCFs for format details).
Please note that this document may be expanded with more detailed information in the near future.
How it works
The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:
- Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
- Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.
Based on this, we emit the genotype likelihoods (PL
) and compute the GQ
(from the PL
s) for the least confidence of these two models.
We use a symbolic allele pair, <NON_REF>
, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD
and PL
field values.
For details of the gVCF format, please see the document that explains what is a gVCF.
Comments
Could you please help me understand why in the below example site "108206689" has it's PL values to be 0,0,0? I checked the depth in these sites and all of them have really nice depth of 564, as its previous block. I am running GATK 3.3.0. haplotypecaller.
chr11 108206591 . A . . END=108206591 GT:DP:GQ:MIN_DP:PL 0/0:169:0:169:0,0,4246
chr11 108206592 . A . . END=108206688 GT:DP:GQ:MIN_DP:PL 0/0:564:99:554:0,120,1800
chr11 108206689 . G . . END=108206698 GT:DP:GQ:MIN_DP:PL 0/0:564:0:555:0,0,0
chr11 108213915 . T . . END=108214010 GT:DP:GQ:MIN_DP:PL 0/0:782:99:716:0,120,1800
chr11 108214011 . T . . END=108214012 GT:DP:GQ:MIN_DP:PL 0/0:714:9:713:0,9,135
@krsna
Hi,
Can you please post the IGV screenshot for that block? Please post both the original bam file and the bamout file.
Thanks,
Sheila
Hi,
I have a similar problem.I am attaching the screenshots for my variant. The reads in the original bam are of good quality. I don't understand why that region is not covered at all in the bamout file.
This is the line I get in gVCF
@shiroann The region you're looking at is a tandem duplication in the sequence. What's probably happening is that when HC builds the assembly graph, all the reads are being assembled into the part of the region on the right, so the part on the left ends up being called ref but with zero confidence, because really we can't know what's going on there.
Thanks for getting back to me Geraldine. What puzzles me is that variant calling for this variant works well in on cohort and not the other. I am attaching one where it has worked. I am using v3.3-0-g37228af.
@shiroann
Hi,
Can you post the GVCF record for the sample that gets called?
Thanks,
Sheila
hi Sheila,
Is this the bit you want?
7 157441485 . C T, 3843.77 . BaseQRankSum=0.139;ClippingRankSum=1.212;DP=101;MLEAC=2,0;MLEAF=1.00,0.00;MQ=58.22;MQ0=0;MQRankSum=0.970;ReadPosRankSum=0.000 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:1,99,0:100:99:0|1:157441485_C_T:3872,244,0,3876,298,3930:1,0,48,51
@shiroann
Hi,
Sorry. I think you did post the GVCF record above. Can you zoom out and post IGV screenshots of the surrounding regions as well. I am curious why the first sample is getting mapped to the right, but the second sample is getting mapped to the same place.
Also, can you try running the first sample with -disableOptimizations, -forceActive, and -dontTrimActiveRegions. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--disableOptimizations
https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--forceActive
https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--dontTrimActiveRegions
Please post the IGV screenshots from that as well.
Thanks,
Sheila
Hi @Sheila ,
Here is the zoomed out Bamout images. The first Bam is for the called sample and the second bam is for the uncalled sample without the additional parameters you suggested. The third image is bamout with the additional parameters-which made the walker run alot slower.
The variant is still not called with the addtional parameters. The GVCF entry is
HI,
When I initially started using GATK I didn't follow videos as it was hard for me to understand and rather the written description helped, and I guess I have improved a lot (Thanks to GATK team). Now, I am checking videos and would like to give some feedback.
For the video on this link: http://www.broadinstitute.org/videos/broade-variant-calling-and-joint-genotyping
I find that at several places its really hard to understand what the author is talking about as there is no cursor pointing to the specific information on the screen. At some other places it seems that the information the author is talking about is not actually being displayed.
Several other video presentations also have similar problem (less or more).
I am entering the next step of learning GATK and I personally think (and probably it should be a common problem for people outside bioinformatics field) that providing the appropriate clues being talked about on the screen would be helpful. Also, improvement of screen resolution would be helpful. I understand that the internet speed outside USA could be a problem where GATK are being used but optimizing the sreeen resolution like in youtube could be helpful.
Also, is there a section where all the video presentation are available under one place?
Thank you in advance !!!
@everestial007
Hi,
Thanks for your feedback!
One of our team members noted the issue of the cursor not being seen in the videos. We have remedied that in the latest workshop by using an iPad that shows the cursor in the videos. Those videos will be uploaded soon.
As for talking about things not in the slides, I will see what the team thinks. I know for the HaplotypeCaller presentation you mentioned (I'm the presenter in that video), there are so many things to say about the tool that not all of it can be pictured in the slides.
Let me also see what the team can do about the screen resolution.
All the presentations from that particular workshop can be found here.
You can also find the slides from most of the workshops we do here.
-Sheila
Hmmm, its good to know that you are the one presenting in the video, I just realized that when I looked at your profile. BTW I am really waiting to check all the videos for furthering my GATK knowledge.
And thanks to you for helping !!!
Hi,
Could you also explain how the PQ score is calculated? Thanks in advance!
Hi, I think I still need a more detailed explanation on the algorithm to understand what exactly happened under HaplotypeCaller. For example, in the first stage of Identifying ActiveRegions, I would like to learn about how the probability is calculated?
In the video (
), you said something about the measure of entropy (which is the negative logarithm of the probability distribution of an event. Speaking of the probability distribution, how you generate that? Is the "probability distribution" referred to the probability distribution of two possible alleles in that site? Furthermore, is the probability of two alleles calculated based on the counts of that two alleles across all the reads spanning that site?).
On the other hand, in the webpage (https://software.broadinstitute.org/gatk/documentation/article?id=11077), you claimed the per-position score is actually a calculated probability of whether that site contains a variant(rather than the measure of entropy. Also, how you calculate this probability, could you please explain it or at least share some relevant links containing explainable content to my confusion) based on the reference-confidence model.
Thank you so much!! I really found that I'm having a very hard time trying to have a progress in learning the mechanism of HaplotypeCaller and subsequent VQSR tools. I'll be really grateful if you can show me some guidance. Thank you!
@Sheila , Sorry I forgot to use mention function.