Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

What is a GVCF and how is it different from a 'regular' VCF?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin
edited April 11 in FAQs

Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

image

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.1 VCF. See also the ##GVCFBlock lines (after the ##FORMAT lines) which indicate the GQ ranges used for banding, and the definition of the MIN_DP annotation in the ##FORMAT lines.

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000117    .   C   T,<NON_REF> 612.77  .   BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235   GT:AD:DP:GQ:PL:SB   0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000211    .   C   T,<NON_REF> 638.77  .   BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549    GT:AD:DP:GQ:PL:SB   0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000694    .   G   A,<NON_REF> 961.77  .   BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB   0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001019    .   T   G,<NON_REF> 93.77   .   BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB   0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • lawremilawremi Posts: 1Member

    Hi, would you please clarify how the DP, GQ and PL fields are summarized for each run? Thanks!

  • GiusyGiusy Posts: 9Member

    Hello! I think I might have a problem in catching how the gVCF really works. WHat I mean is that I followed all the steps on hoe to call variants on cohorts of samples with HC so I obtained a final VCF containing the variants of something like 33 samples. But I noticed that the size is really small (around 122 Mb, which is about the size of a single gVCF that I obtained from HC). Then I performed the VSQR anf following annotation, but I think something went wrong since I end up with a file in which I can't distinguish the differend individuals anymore, moreover the number of variants does seem to me quite small in respect to what I expected. I think I made a really huge mistake somewhere, but I really can't figure out where! Could you help me? Thank you!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    Hi @Giusy,

    If you end up with a file in which you can't distinguish the different individuals anymore, there is probably a problem with the read groups assigned to each sample. It's a pretty common mistake, if you assigned read group information that is not distinct for the various samples (especially the SM tag) then the program merges the information.

    Geraldine Van der Auwera, PhD

  • GiusyGiusy Posts: 9Member

    Dear Geraldine, thank you for your input. I definitely check but I'm quite sure this is not the problem. I added the read groups (including the SM)myself to my sample and I'm pretty sure theone I downloaded from 1000genomes have it already. Moreover ig i look up the output from GenotypeGVCFs in igv I can see all the different samples. Just that when i run the VSQR and annovar as I said I end up with a text file in which I can't distinguish them no longer and the variations are somenthing like 38000, which seems pretty low number to me considering the 34 samples. Is there an other step that I sould perform to extrapolate my samples before annotating it?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    @Giusy

    Ah, then something else is probably wrong, but I'll need more information from you in order to help you diagnose it. Can you please post all the commands you ran, and clarify which file is the one in which you can longer distinguish the samples? If you can post a few records from that file as well that would be very helpful.

    Geraldine Van der Auwera, PhD

  • GiusyGiusy Posts: 9Member

    Sure! These are all the commands I used so far. For simplicity I just omitted all the folder paths

    java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I my_sample.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -R $REF/hs37d5.fa --dbsnp $REF/dbsnp_138.b37.vcf -L $REF/S04380110_Regionsb37.bed -o $SAMPLES/output.raw.snps.indels.g.vcf

    java -Xmx2g -jar $GATK/GenomeAnalysisTK.jar -R $REF/hs37d5.fa -T GenotypeGVCFs --max_alternate_alleles 10 --variant output1.raw.snps.indels.g.vcf --variant output2.raw.snps.indels.g.vcf ..... -o global_output.vcf

    java -jar $GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_output.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $REF/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $REF/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 $REF/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $REF/dbsnp_138.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

    java -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R $REF/hs37d5.fa -input global_output.vcf -mode SNP --ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o global_recalibrated_snps_raw_indels.vcf

    java -jar $GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 $REF/Mills_and_1000G_gold_standard.indels.b37.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R

    java -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R $REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -o global_recalibrated_variants.vcf

    perl $ANNOVAR/convert2annovar.pl -format vcf4 --includeinfo global_recalibrated_variants.vcf > snp.avinput

    perl $ANNOVAR/annotate_variation.pl -geneanno $SAPLES/snp.avinput -buildver hg19 humandb/

    perl $ANNOVAR/table_annovar.pl snp.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp138,ljb23_all -operation g,r,r,f,f,f,f -nastring .

    And this is a capture of the record. I actually noticed that something was wrong looking at it in Excel

    image

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    Ok, well it looks like you've done everything correctly for the GATK part; assuming the output of VQSR is ok, it looks like the problem is in the ANNOVAR output. I can't comment on Annovar's expected behavior since it's not our tool, so you should check with that tool's support at http://www.openbioinformatics.org/annovar/annovar_faq.html#bug

    Geraldine Van der Auwera, PhD

  • GiusyGiusy Posts: 9Member
    edited September 11

    Ok, well anyway it's good news that the GATK part is fine. I'll try to look better into annovar... Thank you very much for your quick help!

    Post edited by Giusy on
  • blueskypyblueskypy Posts: 213Member

    Hi, Geraldine, I wonder if you can help me with the following?

    In the following portion of a gVCF file,

        #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1
        1       1       .       N       <NON_REF>       .       .       END=10353       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
    

    Since the DP is 0, the GT should be No Call, i.e. ./.; why is it 0/0? Also since each line may represent a block, does it mean the stats are the mean of all sites in the block?

    A 2nd question, the doc of GenotypeGVCFs says it “produce correct genotype likelihoods, re-genotype the newly merged record”, does it mean it may change the GT of sites in a sample? For example, from A/T in the gVCF file to A/G after GenotypeGVCFs on a cohort containing the sample, or from No Call to a call?

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    Hi @blueskypy,

    Funny thing, Craig came to me with the same question this morning...

    Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final. In this kind of record, this will end up being a no-call (since REF is N) once it goes through GenotypeGVCFs. The only reason there is a genotype call made here is a leftover bit of code that we are going to take out (hopefully in the next release). In fact, in the near future HC will emit no-call genotypes for every position in a GVCF, to remove this source of confusion. The behavior of regular (non-ERC) mode HC will not be affected.

    To your second question, yes, genotype calls can be changed by GenotypeGVCFs.

    Does that clarify things?

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 213Member

    Thanks so much, Geraldine! So to get the GT of all sites in a sample, I should use GenotypeGVCFs --includeNonVariantSites? if so, two more questions: 1. does it has an option similar to -ERC GVCF or BP_RESOLUTION for those non-variant sites? 2. Does it include both 0/0 and ./. sites? if not, does it mean that sites absent in the vcf output are all ./. sites? Is there a way to include ./. sites as well?

    Finally, not related but just wondering, does --standard_min_confidence_threshold_for_calling in GenotypeGVCFs refer to the GQ in GVCF file?

    Thanks a lot!

  • blueskypyblueskypy Posts: 213Member

    A followup question, my understanding is that VQSR only uses variant sites, so whether --includeNonVariantSites or not in GenotypeGVCFs has no effect on VQSR, is it right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    @blueskypy

    Whoo that's a lot of questions. Yes, no, yes (I think). No, that arg refers to the variant confidence (QUAL). And yes, VQSR will ignore monomorphic hom-ref sites.

    Geraldine Van der Auwera, PhD

  • naymikmnaymikm Posts: 5Member

    Hi, I'm curious if I made a gvcf using 2 samples (1 normal, 1 cancer) from the same individual, could I use this resulting vcf file directly with snpEff?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,407Administrator, GATK Developer admin

    @naymikm‌, the GVCF is only an intermediate format. You must run GenotypeGVCFs on it before doing any further processing such as functional annotation with snpEFF.

    Geraldine Van der Auwera, PhD

  • naymikmnaymikm Posts: 5Member

    Sorry I worded that poorly. Would the resulting vcf file after running GenotypeGVCFs be usable with snpEFF directly?

  • SheilaSheila Broad InstitutePosts: 527Member, GATK Developer, Broadie, Moderator admin

    @naymikm

    Hello,

    Yes, your vcf will work with snpEFF.

    -Sheila

Sign In or Register to comment.