The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

How the HaplotypeCaller's reference confidence model works

edited February 2015

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

• Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.

• Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Post edited by dekling on

Geraldine Van der Auwera, PhD

Tagged:

Issue · Github May 2015 by Geraldine_VdAuwera

Issue Number
35
State
open
Last Updated
Assignee
Array

• FloridaMember Posts: 1

Could you please help me understand why in the below example site "108206689" has it's PL values to be 0,0,0? I checked the depth in these sites and all of them have really nice depth of 564, as its previous block. I am running GATK 3.3.0. haplotypecaller.

chr11 108206591 . A . . END=108206591 GT:DP:GQ:MIN_DP:PL 0/0:169:0:169:0,0,4246
chr11 108206592 . A . . END=108206688 GT:DP:GQ:MIN_DP:PL 0/0:564:99:554:0,120,1800
chr11 108206689 . G . . END=108206698 GT:DP:GQ:MIN_DP:PL 0/0:564:0:555:0,0,0
chr11 108213915 . T . . END=108214010 GT:DP:GQ:MIN_DP:PL 0/0:782:99:716:0,120,1800
chr11 108214011 . T . . END=108214012 GT:DP:GQ:MIN_DP:PL 0/0:714:9:713:0,9,135

@krsna
Hi,

Can you please post the IGV screenshot for that block? Please post both the original bam file and the bamout file.

Thanks,
Sheila

• Member Posts: 6
edited March 2015

Hi,
I have a similar problem.I am attaching the screenshots for my variant. The reads in the original bam are of good quality. I don't understand why that region is not covered at all in the bamout file.
This is the line I get in gVCF

7       157441485       .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:4,80:84:0:0,0,0

Post edited by Geraldine_VdAuwera on

@shiroann The region you're looking at is a tandem duplication in the sequence. What's probably happening is that when HC builds the assembly graph, all the reads are being assembled into the part of the region on the right, so the part on the left ends up being called ref but with zero confidence, because really we can't know what's going on there.

Geraldine Van der Auwera, PhD

• Member Posts: 6
edited March 2015

Thanks for getting back to me Geraldine. What puzzles me is that variant calling for this variant works well in on cohort and not the other. I am attaching one where it has worked. I am using v3.3-0-g37228af.

7       157441485       .       C       T,<NON_REF>     3843.77 .       BaseQRankSum=0.139;ClippingRankSum=1.212;DP=101;MLEAC=2,0;MLEAF=1.00,0.00;MQ=58.22;MQ0=0;MQRankSum=0.970;ReadPosRankSum=0.000   GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:1,99,0:100:99:0|1:157441485_C_T:3872,244,0,3876,298,3930:1,0,48,51


@shiroann
Hi,

Can you post the GVCF record for the sample that gets called?

Thanks,
Sheila

• Member Posts: 6

hi Sheila,

Is this the bit you want?

@shiroann
Hi,

Sorry. I think you did post the GVCF record above. Can you zoom out and post IGV screenshots of the surrounding regions as well. I am curious why the first sample is getting mapped to the right, but the second sample is getting mapped to the same place.

Also, can you try running the first sample with -disableOptimizations, -forceActive, and -dontTrimActiveRegions. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--disableOptimizations

Please post the IGV screenshots from that as well.

Thanks,
Sheila

• Member Posts: 6

Hi @Sheila ,

Here is the zoomed out Bamout images. The first Bam is for the called sample and the second bam is for the uncalled sample without the additional parameters you suggested. The third image is bamout with the additional parameters-which made the walker run alot slower.

The variant is still not called with the addtional parameters. The GVCF entry is

7 157441485 . C . . END=157441485 GT:DP:GQ:MIN_DP:PL 0/0:84:0:84:0,0,0

• GreensboroMember Posts: 65

HI,

When I initially started using GATK I didn't follow videos as it was hard for me to understand and rather the written description helped, and I guess I have improved a lot (Thanks to GATK team). Now, I am checking videos and would like to give some feedback.

I find that at several places its really hard to understand what the author is talking about as there is no cursor pointing to the specific information on the screen. At some other places it seems that the information the author is talking about is not actually being displayed.
Several other video presentations also have similar problem (less or more).

I am entering the next step of learning GATK and I personally think (and probably it should be a common problem for people outside bioinformatics field) that providing the appropriate clues being talked about on the screen would be helpful. Also, improvement of screen resolution would be helpful. I understand that the internet speed outside USA could be a problem where GATK are being used but optimizing the sreeen resolution like in youtube could be helpful.

Also, is there a section where all the video presentation are available under one place?

One of our team members noted the issue of the cursor not being seen in the videos. We have remedied that in the latest workshop by using an iPad that shows the cursor in the videos. Those videos will be uploaded soon.
As for talking about things not in the slides, I will see what the team thinks. I know for the HaplotypeCaller presentation you mentioned (I'm the presenter in that video), there are so many things to say about the tool that not all of it can be pictured in the slides.
Let me also see what the team can do about the screen resolution.
All the presentations from that particular workshop can be found here.
You can also find the slides from most of the workshops we do here.

-Sheila

• GreensboroMember Posts: 65

Hmmm, its good to know that you are the one presenting in the video, I just realized that when I looked at your profile. BTW I am really waiting to check all the videos for furthering my GATK knowledge.
And thanks to you for helping !!!