Using the reference confidence calculation mode and generating a GVCF [RETIRED]

Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

This document is out of date and has been replaced by the following: http://www.broadinstitute.org/gatk/guide/article?id=4017

As of GATK 2.7, the HaplotypeCaller is now capable of emitting a per-bp or summarized confidence estimate for a site being strictly homozygous-reference.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either a non-reference call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

  • Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
  • Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele, <NON_REF>, to indicate that the site is homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

How to use it

This mode can be enabled with the command line argument --emitRefConfidence (-ERC), which has 3 options:

  • NONE: don't emit anything (default)
  • BP_RESOLUTION: emit detailed information for each BP
  • GVCF: emit a block summarized version of the BP_RESOLUTION data

BP_RESOLUTION mode

This mode emits a very detailed model for the reference confidence at each reference base within the requested calling intervals. The output looks like:

20      10000435        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:54,0:53:54:99:0,160,2385
20      10000436        .       G       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:53,0:53:53:99:0,159,2011
20      10000437        .       C       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:53,0:53:53:99:0,159,1971
20      10000438        .       C       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:53,0:53:53:99:0,159,1816
20      10000439        .       T       G       1553.77 .       AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75    GT:AD:DP:GQ:PL  1/1:0,55:55:99:1582,164,0
20      10000440        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:55,0:55:55:99:0,166,2063
20      10000441        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:55,0:55:55:99:0,165,2000
20      10000442        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,168,2095
20      10000443        .       A       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,169,2089
20      10000444        .       A       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,168,2093
20      10000445        .       C       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,169,2137

This shows each reference base in a 11 bp interval. Only one site (10000439) contains a SNP call. The other 10 bases are reference. For each reference base not part of a variant call we emit a VCF record with have the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site. The genotype fields include the standard GATK GT, AD, DP, GQ, and PL fields, all calculated as ref vs. any non-reference base. There's a new special CD field as well, that gives the number of indel informative reads at this location, given the maximum size of the indel to consider.

Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it.

GVCF mode

GVCF mode summarizes the output of BP_RESOLUTION by grouping sequential reference calls with similar GQ values into blocks, and emitting summarized blocks instead of the values at each base pair. The blocking is discrete, deterministic, and can be controlled by the command line parameter GVCFGQBands. The algorithm creates bands of GQ values and merges sequential blocks with GQ values within the same band into a single synthetic block.

For example, if the GVCFBQBands argument is -GQB 10 -GQB 20 then we'll create blocks in the GVCF where sequential <NON_REF> calls have GQ values either < 10, between 10 and 19, or are 20 or greater.

The GVCF INFO and FORMAT fields have a few new annotations. First, END and BLOCK_SIZE indicate the extent of the reference block and its size (redundant yes but easy to read). For each GVCF sample field there's now only GT, DP and GQ from the standard annotations. GT is always 0/0 while DP and GQ are representative values (currently median) from the bp resolution DP and GQ values within the block. The MIN_DP and MIN_GQ are the minimum values seen within the block for their respective fields.

The output for the above example in GVCF mode looks like:

20      10000435        .       T       <NON_REF>       .       .       BLOCK_SIZE=4;END=10000438       GT:DP:GQ:MIN_DP:MIN_GQ  0/0:53:99:53:159
20      10000439        .       T       G       1553.77 .       AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75    GT:AD:DP:GQ:PL  1/1:0,55:55:99:1582,164,0
20      10000440        .       T       <NON_REF>       .       .       BLOCK_SIZE=6;END=10000445       GT:DP:GQ:MIN_DP:MIN_GQ  0/0:56:99:55:165

Here's a GVCF output for 100 bp around the same variant:

20      10000400        .       T       <NON_REF>       .       .       BLOCK_SIZE=39;END=10000438      GT:DP:GQ:MIN_DP:MIN_GQ  0/0:57:99:53:128
20      10000439        .       T       G       1553.77 .       AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75    GT:AD:DP:GQ:PL  1/1:0,55:55:99:1582,164,0
20      10000440        .       T       <NON_REF>       .       .       BLOCK_SIZE=61;END=10000500      GT:DP:GQ:MIN_DP:MIN_GQ  0/0:54:99:49:136

Current caveats

  • Currently only works for single sample calling
  • Currently slows down the calculation significantly (but this will be improved in the future)
  • Only GVCF mode is likely to work with downstream GATK tools right now, as the special <NON_REF> allele may be treated as a polymorphic site. If you want to use the BP resolution mode you will likely need to split the VCF into the ref sites and polymorphic sites, perform standard filtering etc to the polymorphic VCF, and then merge the ref and non-ref VCFs back together

Feedback

We are interested in community feedback on the output format and additional information we should provide to make the reference model as useful as possible. Please post suggestions as comments here.

Post edited by Geraldine_VdAuwera on

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

Comments

  • dparkdpark Posts: 9Member

    Hi Mark, some random thoughts:

    For GVCF, if the goal is for one of those NON_REF lines to describe a block, shouldn't the REF allele be as long as the BLOCK_SIZE? If you're trying to find a way to represent this in VCF format, it would make the most sense to me to have the REF allele represent a stretch of reference sequence that is as long as the block being described. That would be most consistent with how most programs interpret VCF in general. Then, in fact, the BLOCK_SIZE and END fields seem unnecessary (they're already redundant between each other as it is).

    For the listed caveat about downstream GATK filtering tools: I can see why that might be an issue in general (until downstream tools are updated), but I don't quite see how it might be different for GVCF vs. BP resolution. What's the difference that makes it worse for BP resolution?

    Danny

    Danny Park, PhD -- Broad Institute, IDI, Sabeti Lab

  • rpoplinrpoplin Posts: 122GATK Developer mod
    edited August 2013

    Hey Danny. Thanks for the feedback!

    We chose to represent these reference blocks similar to how symbolic alleles from the 1000 Genomes Project are represented for ease of readability. We think that because of the symbolic alternate allele "<NON_REF>" the END field is required for the VCF spec. Perhaps we can add an option to output these blocks with the full reference representation if it helps.

    Just to be clear, the two options are to go from this:

    chr1 1 A <NON_REF> END=10

    to this:

    chr1 1 ACTGACTGAC <NON_REF> END=10

    For your second question- I think you are correct and there isn't anything that inherently makes the BP resolution representation worse. As a side note we are now actively working on updating downstream GATK tools to work with this reference representation as it has proved to be incredibly valuable!

    Happy to chat more,

    Post edited by rpoplin on
  • KatherineSKatherineS Posts: 9Member

    Is there any way to perform multi-sample genotyping using HaplotypeCaller in 2.7? As of this version, --output_mode has been replaced by --emitRefConfidence, but as noted above the latter only works for a single sample, whereas --output_mode with EMIT_ALL_SITES worked for multiple samples. I am running some analyses where I would like to use 2.7 consistently.

    Thanks very much,

    Katherine

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi Katherine,

    You can still perform multisample calling with HC in the normal mode (the equivalent of the old --output_mode VARIANTS_ONLY), but there is currently no way to get HC to emit all sites in multisample calling.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi Geraldine,

    Thanks for the reply. Unfortunately I need to perform multisample genotyping (-genotyping_mode GENOTYPE_GIVEN_ALLELES) rather than variant calling (--genotyping_mode DISCOVERY). I will have to return to version 2.6 or use UnifiedGenotyper in 2.7 for the moment.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi Katherine, you can still do multisample genotyping with GGA mode in HC 2.7; just pass the vcfs of alleles as both the alleles argument and as interval list (-L). That will produce genotype calls for all the sites you have alleles for, which are presumably the sites you are interested in?

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    I tried to post a reponse here on October 16th and received an 'awaiting moderator approval' message. My post still hasn't appeared. Has something gone wrong, or should I practise being patient? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hmm, that is odd --I'll have a look and let you know.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi @KatherineS, I didn't find your post in the moderation queue, so it might have got lost -- sorry about that. Our spam filters sometimes get a little overzealous. Please feel free to repost.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    I tried to post again to say that I have reposted in another thread, and that message is also awaiting moderator approval - so the problem seems to be including a URL in the post (I linked to that other thread in both posts).

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Ah, that would make sense -- I'll look into it, but in the meantime I've verified your account so you should now be able to post links without getting hit by the spam filter.

    Geraldine Van der Auwera, PhD

  • lucdhlucdh Posts: 10Member
    edited December 2013

    In the "NON_REF" cases shown in you example of BP_RESOLUTION mode, the PL field always suggests a homozygous reference call. I have encountered some case where "NON_REF" co-occurs with a PL field that suggests heterozygosity. Is that expected? I would prefer to rely on the PL field but maybe I shouldn't in these cases? For instance, using HaplotypeCaller GATK 2.7-4 I get:

    13 32906480 . A NON_REF . . . GT:AD:CD:DP:GQ:PL 0/0:231,216:445:447:0:4780,0,5856

    A the same position, the UnifiedGenotyper GATK 2.7-4 says:

    13 32906480 rs766173 A C 2085.77 . ABHet=0.540;AC=1;AF=0.500;AN=2;BaseCounts=108,92,0,0;BaseQRankSum=-11.012;DB;DP=200;Dels=0.00;FS=1.109;HRun=0;HaplotypeScore=13.7700;LowMQ=0.0000,0.0000,200;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.864;PercentNBaseSolid=0.0000;QD=10.43;ReadPosRankSum=0.874;Samples=GC008004-C01-BRCA_Validation2;VariantType=SNP GT:AB:AD:APL:DP:GQ:MQ0:PL 0/1:0.540:108,92:2114,0,3091,2417,3332,5749,2417,3332,5749,5749:200:99:0:2114,0,3091

    Post edited by lucdh on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi @lucdh,

    That behavior is not expected, no. Could you please upload some test files to our FTP so we can debug this locally? Detailed instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • volaviivolavii Posts: 17Member

    Hi,

    i assumed, that every position (base) of a chromosome is listed in the vcf-file created by BP_RESOLUTION. I tested this situation with a small chromosome (5002 bp). The vcf-file only contains 4949 bases. Is this due to a deletion? I counted all SNPs, INDELs and NON_REFs and got a value of 4949. Why are they missing?

    Thanks in advance, Volavii

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    If you have an indel that covers several positions, the program considers that those positions are accounted for and does not emit a record for them, as I recall. Have a look at the records surrounding indels and see if that's what's happening in your data.

    Geraldine Van der Auwera, PhD

  • volaviivolavii Posts: 17Member
    edited March 26

    Jepp, there is a very long indel (54bp):) But I also can see a SNP in this region (IGV), which is not detected. Is this due to the fact that it was not emitted in the first place, or that the INDEL was found inbevore and therefore covers this SNP?

    And something different The GATK search (upper right) is not working anymore , I tested it on a second PC, and it did not worked either....

    Post edited by volavii on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi @volavii‌,

    It's possible that the SNP was disregarded as an artifact. If you believe this is in error, you can get HaplotypeCaller to emit the assembled haplotypes that it considered using the following argument:

    -bamOut haplotypes.bam -bamWriterType ALL_POSSIBLE_HAPLOTYPES

    It will tell you if it saw the SNP as a valid possibility or not.

    You are correct that the search function is currently broken; I'm hoping to get this fixed asap. In the meantime you can use Google search; most of the time it brings up the most relevant forum posts anyway if you include GATK forum in the search.

    Geraldine Van der Auwera, PhD

This discussion has been closed.