To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GVCF - Genomic Variant Call Format

edited December 2017 in Dictionary

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variant discovery analyses.

Important notes

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller in GATK versions 3.x and 4.x contain additional information that is formatted in a very specific way. Read on to find out more.

GVCF files produced by HaplotypeCaller from GATK versions 3.x and 4.x are not substantially different. While we don't recommend mixing versions, and we have not tested this ourselves, it should be okay to use GVCFs made by different versions if the annotations and the GVCFBlock definitions (see below) are the same.


General comparison of VCF vs. GVCF

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

image

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION GVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of GVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a GVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a GVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the GVCF header. The purpose of the blocks (also called banding) is to keep file size down, so we recommend using the -GVCF option over BP_RESOLUTION.


Example GVCF file

This is a banded GVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.2 VCF:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One FORMAT annotation is unique to the GVCF format:

##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">

This defines what was the minimum amount of coverage observed at any one site within a block of records.

The header goes on:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="[full command line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 4:04:34 PM EST">

At this point in the header we see the GVCFBlock definitions, which indicate the GQ ranges used for banding:

[individual blocks from 1 to 55]
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)

In recent versions of GATK, the banding strategy has been tuned to provide high resolution at lower values of GQ (59 and below) and more compression at high values (60 and above). Note that since GQ is capped at 99, records where the corresponding PL is greater than 99 are lumped into the 99-100 band.

After that, the header goes on:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=GRCh37>
##source=HaplotypeCaller

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10001567 and ends at 20:10001616.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10001567    .   A   <NON_REF>   .   .   END=10001616    GT:DP:GQ:MIN_DP:PL  0/0:38:99:34:0,101,1114
20  10001617    .   C   A,<NON_REF> 493.77  .   BaseQRankSum=1.632;ClippingRankSum=0.000;DP=38;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=136800.00;ReadPosRankSum=0.170    GT:AD:DP:GQ:PL:SB   0/1:19,19,0:38:99:522,0,480,578,538,1116:11,8,13,6
20  10001618    .   T   <NON_REF>   .   .   END=10001627    GT:DP:GQ:MIN_DP:PL  0/0:39:99:37:0,105,1575
20  10001628    .   G   A,<NON_REF> 1223.77 .   DP=37;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=133200.00   GT:AD:DP:GQ:PL:SB   1/1:0,37,0:37:99:1252,111,0,1252,111,1252:0,0,21,16
20  10001629    .   G   <NON_REF>   .   .   END=10001660    GT:DP:GQ:MIN_DP:PL  0/0:43:99:38:0,102,1219
20  10001661    .   T   C,<NON_REF> 1779.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1808,129,0,1808,129,1808:0,0,26,16
20  10001662    .   T   <NON_REF>   .   .   END=10001669    GT:DP:GQ:MIN_DP:PL  0/0:44:99:43:0,117,1755
20  10001670    .   T   G,<NON_REF> 1773.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1802,129,0,1802,129,1802:0,0,25,17
20  10001671    .   G   <NON_REF>   .   .   END=10001673    GT:DP:GQ:MIN_DP:PL  0/0:43:99:42:0,120,1800
20  10001674    .   A   <NON_REF>   .   .   END=10001674    GT:DP:GQ:MIN_DP:PL  0/0:42:96:42:0,96,1197
20  10001675    .   A   <NON_REF>   .   .   END=10001695    GT:DP:GQ:MIN_DP:PL  0/0:41:99:39:0,105,1575
20  10001696    .   A   <NON_REF>   .   .   END=10001696    GT:DP:GQ:MIN_DP:PL  0/0:38:97:38:0,97,1220

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Post edited by Geraldine_VdAuwera on
Sign In or Register to comment.