How the HaplotypeCaller's reference confidence model works

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin
edited February 20 in Methods and Algorithms

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

  • Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
  • Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Post edited by dekling on

Geraldine Van der Auwera, PhD

Issue · Github
by Geraldine_VdAuwera

Issue Number
35
State
open
Last Updated

Comments

  • krsnakrsna FloridaPosts: 1Member

    Could you please help me understand why in the below example site "108206689" has it's PL values to be 0,0,0? I checked the depth in these sites and all of them have really nice depth of 564, as its previous block. I am running GATK 3.3.0. haplotypecaller.

    chr11 108206591 . A . . END=108206591 GT:DP:GQ:MIN_DP:PL 0/0:169:0:169:0,0,4246
    chr11 108206592 . A . . END=108206688 GT:DP:GQ:MIN_DP:PL 0/0:564:99:554:0,120,1800
    chr11 108206689 . G . . END=108206698 GT:DP:GQ:MIN_DP:PL 0/0:564:0:555:0,0,0
    chr11 108213915 . T . . END=108214010 GT:DP:GQ:MIN_DP:PL 0/0:782:99:716:0,120,1800
    chr11 108214011 . T . . END=108214012 GT:DP:GQ:MIN_DP:PL 0/0:714:9:713:0,9,135

  • SheilaSheila Broad InstitutePosts: 1,405Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @krsna
    Hi,

    Can you please post the IGV screenshot for that block? Please post both the original bam file and the bamout file.

    Thanks,
    Sheila

  • shiroannshiroann Posts: 6Member
    edited March 30

    Hi,
    I have a similar problem.I am attaching the screenshots for my variant. The reads in the original bam are of good quality. I don't understand why that region is not covered at all in the bamout file.
    This is the line I get in gVCF

    7       157441485       .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:4,80:84:0:0,0,0
    
    chr7_157,441,435_157,441,535.png
    1249 x 1200 - 25K
    BAM_OUT.png
    1889 x 1056 - 23K
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    @shiroann The region you're looking at is a tandem duplication in the sequence. What's probably happening is that when HC builds the assembly graph, all the reads are being assembled into the part of the region on the right, so the part on the left ends up being called ref but with zero confidence, because really we can't know what's going on there.

    Geraldine Van der Auwera, PhD

  • shiroannshiroann Posts: 6Member
    edited March 30

    Thanks for getting back to me Geraldine. What puzzles me is that variant calling for this variant works well in on cohort and not the other. I am attaching one where it has worked. I am using v3.3-0-g37228af.

    7       157441485       .       C       T,<NON_REF>     3843.77 .       BaseQRankSum=0.139;ClippingRankSum=1.212;DP=101;MLEAC=2,0;MLEAF=1.00,0.00;MQ=58.22;MQ0=0;MQRankSum=0.970;ReadPosRankSum=0.000   GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:1,99,0:100:99:0|1:157441485_C_T:3872,244,0,3876,298,3930:1,0,48,51
    
    recal_bam.png
    1889 x 1056 - 29K
    bamout.png
    1889 x 1056 - 21K
    Post edited by Geraldine_VdAuwera on
  • SheilaSheila Broad InstitutePosts: 1,405Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @shiroann
    Hi,

    Can you post the GVCF record for the sample that gets called?

    Thanks,
    Sheila

  • shiroannshiroann Posts: 6Member

    hi Sheila,

    Is this the bit you want?

    7 157441485 . C T, 3843.77 . BaseQRankSum=0.139;ClippingRankSum=1.212;DP=101;MLEAC=2,0;MLEAF=1.00,0.00;MQ=58.22;MQ0=0;MQRankSum=0.970;ReadPosRankSum=0.000 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:1,99,0:100:99:0|1:157441485_C_T:3872,244,0,3876,298,3930:1,0,48,51

  • SheilaSheila Broad InstitutePosts: 1,405Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @shiroann
    Hi,

    Sorry. I think you did post the GVCF record above. Can you zoom out and post IGV screenshots of the surrounding regions as well. I am curious why the first sample is getting mapped to the right, but the second sample is getting mapped to the same place.

    Also, can you try running the first sample with -disableOptimizations, -forceActive, and -dontTrimActiveRegions. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--disableOptimizations
    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--forceActive
    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--dontTrimActiveRegions

    Please post the IGV screenshots from that as well.

    Thanks,
    Sheila

  • shiroannshiroann Posts: 6Member

    Hi @Sheila ,

    Here is the zoomed out Bamout images. The first Bam is for the called sample and the second bam is for the uncalled sample without the additional parameters you suggested. The third image is bamout with the additional parameters-which made the walker run alot slower.

    The variant is still not called with the addtional parameters. The GVCF entry is

    7 157441485 . C . . END=157441485 GT:DP:GQ:MIN_DP:PL 0/0:84:0:84:0,0,0

    called_zoomed_out.png
    1888 x 1056 - 37K
    zoomed_out.png
    1888 x 1056 - 32K
    new_bamout.png
    1889 x 1056 - 53K
Sign In or Register to comment.