The current GATK version is 3.4-46

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# What is a GVCF and how is it different from a 'regular' VCF?

edited October 2014 in FAQs

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 1Member

Hi, would you please clarify how the DP, GQ and PL fields are summarized for each run? Thanks!

• Posts: 9Member

Hello! I think I might have a problem in catching how the gVCF really works. WHat I mean is that I followed all the steps on hoe to call variants on cohorts of samples with HC so I obtained a final VCF containing the variants of something like 33 samples. But I noticed that the size is really small (around 122 Mb, which is about the size of a single gVCF that I obtained from HC). Then I performed the VSQR anf following annotation, but I think something went wrong since I end up with a file in which I can't distinguish the differend individuals anymore, moreover the number of variants does seem to me quite small in respect to what I expected. I think I made a really huge mistake somewhere, but I really can't figure out where! Could you help me?
Thank you!!

Hi @Giusy,

If you end up with a file in which you can't distinguish the different individuals anymore, there is probably a problem with the read groups assigned to each sample. It's a pretty common mistake, if you assigned read group information that is not distinct for the various samples (especially the SM tag) then the program merges the information.

Geraldine Van der Auwera, PhD

• Posts: 9Member

Dear Geraldine, thank you for your input. I definitely check but I'm quite sure this is not the problem. I added the read groups (including the SM)myself to my sample and I'm pretty sure theone I downloaded from 1000genomes have it already. Moreover ig i look up the output from GenotypeGVCFs in igv I can see all the different samples. Just that when i run the VSQR and annovar as I said I end up with a text file in which I can't distinguish them no longer and the variations are somenthing like 38000, which seems pretty low number to me considering the 34 samples. Is there an other step that I sould perform to extrapolate my samples before annotating it?

Ah, then something else is probably wrong, but I'll need more information from you in order to help you diagnose it. Can you please post all the commands you ran, and clarify which file is the one in which you can longer distinguish the samples? If you can post a few records from that file as well that would be very helpful.

Geraldine Van der Auwera, PhD

• Posts: 9Member

Sure! These are all the commands I used so far. For simplicity I just omitted all the folder paths

java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I my_sample.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -R$REF/hs37d5.fa --dbsnp $REF/dbsnp_138.b37.vcf -L$REF/S04380110_Regionsb37.bed -o $SAMPLES/output.raw.snps.indels.g.vcf java -Xmx2g -jar$GATK/GenomeAnalysisTK.jar -R $REF/hs37d5.fa -T GenotypeGVCFs --max_alternate_alleles 10 --variant output1.raw.snps.indels.g.vcf --variant output2.raw.snps.indels.g.vcf ..... -o global_output.vcf java -jar$GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_output.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0$REF/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $REF/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0$REF/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $REF/dbsnp_138.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R java -jar$GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R $REF/hs37d5.fa -input global_output.vcf -mode SNP --ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o global_recalibrated_snps_raw_indels.vcf java -jar$GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0$REF/Mills_and_1000G_gold_standard.indels.b37.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R

java -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R$REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -o global_recalibrated_variants.vcf

perl $ANNOVAR/convert2annovar.pl -format vcf4 --includeinfo global_recalibrated_variants.vcf > snp.avinput perl$ANNOVAR/annotate_variation.pl -geneanno $SAPLES/snp.avinput -buildver hg19 humandb/ perl$ANNOVAR/table_annovar.pl snp.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp138,ljb23_all -operation g,r,r,f,f,f,f -nastring .

And this is a capture of the record. I actually noticed that something was wrong looking at it in Excel

Ok, well it looks like you've done everything correctly for the GATK part; assuming the output of VQSR is ok, it looks like the problem is in the ANNOVAR output. I can't comment on Annovar's expected behavior since it's not our tool, so you should check with that tool's support at http://www.openbioinformatics.org/annovar/annovar_faq.html#bug

Geraldine Van der Auwera, PhD

• Posts: 9Member
edited September 2014

Ok, well anyway it's good news that the GATK part is fine. I'll try to look better into annovar... Thank you very much for your quick help!

Post edited by Giusy on
• Posts: 253Member ✭✭

Hi, Geraldine,
I wonder if you can help me with the following?

In the following portion of a gVCF file,

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1
1       1       .       N       <NON_REF>       .       .       END=10353       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0


Since the DP is 0, the GT should be No Call, i.e. ./.; why is it 0/0?
Also since each line may represent a block, does it mean the stats are the mean of all sites in the block?

A 2nd question, the doc of GenotypeGVCFs says it “produce correct genotype likelihoods, re-genotype the newly merged record”, does it mean it may change the GT of sites in a sample? For example, from A/T in the gVCF file to A/G after GenotypeGVCFs on a cohort containing the sample, or from No Call to a call?

Thanks,

Hi @blueskypy,

Funny thing, Craig came to me with the same question this morning...

Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final. In this kind of record, this will end up being a no-call (since REF is N) once it goes through GenotypeGVCFs. The only reason there is a genotype call made here is a leftover bit of code that we are going to take out (hopefully in the next release). In fact, in the near future HC will emit no-call genotypes for every position in a GVCF, to remove this source of confusion. The behavior of regular (non-ERC) mode HC will not be affected.

To your second question, yes, genotype calls can be changed by GenotypeGVCFs.

Does that clarify things?

Geraldine Van der Auwera, PhD

• Posts: 253Member ✭✭

Thanks so much, Geraldine! So to get the GT of all sites in a sample, I should use GenotypeGVCFs --includeNonVariantSites? if so, two more questions:
1. does it has an option similar to -ERC GVCF or BP_RESOLUTION for those non-variant sites?
2. Does it include both 0/0 and ./. sites? if not, does it mean that sites absent in the vcf output are all ./. sites? Is there a way to include ./. sites as well?

Finally, not related but just wondering, does --standard_min_confidence_threshold_for_calling in GenotypeGVCFs refer to the GQ in GVCF file?

Thanks a lot!

• Posts: 253Member ✭✭

A followup question, my understanding is that VQSR only uses variant sites, so whether --includeNonVariantSites or not in GenotypeGVCFs has no effect on VQSR, is it right?

Whoo that's a lot of questions. Yes, no, yes (I think). No, that arg refers to the variant confidence (QUAL). And yes, VQSR will ignore monomorphic hom-ref sites.

Geraldine Van der Auwera, PhD

• Posts: 6Member

Hi, I'm curious if I made a gvcf using 2 samples (1 normal, 1 cancer) from the same individual, could I use this resulting vcf file directly with snpEff?

@naymikm‌, the GVCF is only an intermediate format. You must run GenotypeGVCFs on it before doing any further processing such as functional annotation with snpEFF.

Geraldine Van der Auwera, PhD

• Posts: 6Member

Sorry I worded that poorly. Would the resulting vcf file after running GenotypeGVCFs be usable with snpEFF directly?

Hello,

Yes, your vcf will work with snpEFF.

-Sheila

• Posts: 12Member

Please could you explain how GQ is calculated in the GVCF format for a given block. I noticed that in cases where the MIN_DP value is less than the DP value, GQ is always significantly higher than the PL for a 0/1 call. Why would that be the case?

20 10000759 . A . . END=10001018 GT:DP:GQ:MIN_DP:PL 0/0:40:99:28:0,65,1080

Why is GQ not 65? Note that DP=40 and MIN_DP=28?

Hi @pwhiteusa,

I'm not sure what you mean. The PL values are normalized so that the PL of the called genotype is 0 (see this FAQ for details). The PL 65 is for the het possibility, but the record you mention is hom-ref.

Geraldine Van der Auwera, PhD

• Posts: 12Member
edited November 2014

Here are four examples where GQ is high:

20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315


And four examples where it is not:

20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315


What I have observed is that in GVCF format, a block of reference calls typically is given a GQ of the 0/1 PL, but only if DP=MIN_DP.

If Min_DP < DP, then the GQ value is always higher than the 0/1 PL. You see the effect more clearly if there is a lower coverage sample, but essentially the higher the delta between the MIN_DP and DP value then the greater the GQ.

Q1. Why is that the case?

Q2. How is GQ calculated in the GVCF format for a given block?

Post edited by Geraldine_VdAuwera on

@pwhiteusa I don't see the trends you are seeing. The first 4 records, which you say have high GQ, have GQs of 99, 99, 12, 42. The other 4 have GQs of 45, 26, 25, 49. So two of the first four are really in the wrong category. And for the second category, the DP=MIN_DP observation does not apply for the first element. I think you may be seeing coincidental similarities.

Generally speaking, it's not surprising that you might see larger GQs when the MIN_DP and DP are farther apart, because that suggests that there are some samples with even more coverage than those already respectable depth values. With that amount of depth, you can get a good amount of confidence in the call.

IIRC, the GQ of a block is the lowest GQ of the sites that are all within the same GQ range (as defined in the vcf header).

Geraldine Van der Auwera, PhD

• Posts: 12Member
edited December 2014

@Geraldine_VdAuwera here's what I am observing in the examples.

When DP = Min_DP the GQ value is always the second highest PL value (or very close to it). This for the examples above in the second set of four:

GQ = 0 and PL = 0,0,1121

GQ = 72 and PL = 0,72,1080

GQ = 37 and PL = 0,37,909

GQ = 21 and PL = 0,21,315

Now, lets look at the first block where DP > Min_DP. This is no longer the case:

GQ = 99 and PL = 0,95,1226 Note: DP - Min_DP = 4

GQ = 99 and PL = 0,120,1800 Note: DP - Min_DP = 3

GQ = 12 and PL = 0,9,135 Note: DP - Min_DP = 3

GQ = 42 and PL = 0,21,315 Note: DP - Min_DP = 2

Do you see the trend? When Min_DP is less than DP, the GQ value is always higher than the second highest PL value - why would that be the case?

For example, in the 4th example, where DP and Min_DP are both 49, the GQ (21) is the same as the 0/1 PL (21). Now look at the last example (the 8th) - GQ (42) is double the 0/1 PL (21) and DP is actually less (39-41) in this case.

So I guess it ultimately comes down to the answer you gave to question 2 above:

How is GQ calculated in the GVCF format for a given block?: "the GQ of a block is the lowest GQ of the sites that are all within the same GQ range". If that were the case, in situations where DP > Min_DP would you not expect the GQ value to go down, not up?

Post edited by pwhiteusa on
• Posts: 12Member
edited December 2014

Sorry @Geraldine_VdAuwera‌ could you also explain your comment: "because that suggests that there are some samples with even more coverage than those already respectable depth values" - does MinDP refer to the depth from other samples - or did you mean bases?

Post edited by pwhiteusa on

@‌pwhiteusa Thanks for clarifying -- I think I understand better now what you mean, but the "trend" is just the difference between single-site blocks (your first set) and multi-site blocks (your second set), which you can tell from the END tag in your earlier post.

For the first set, you can read the PL and GQ as if it was a regular (not GVCF) record: the GQ is equal to the second-likeliest PL cue to how the normalizing is done; that is expected. In the other set, there are other sites with GQs that may be different, although they should still be in the same GQ band (as defined above), so that's why you can see a slight mismatch between the GQ and the second-likeliest PL. I'll check with the developers whether it makes sense that the rule for integrating the values from multiple sites is not exactly the same for different fields, since that seems to be causing the confusion here.

To your other question, I mistyped my answer -- I did mean sites, not samples, sorry. Note that when you have a multi-site block, DP should always be either equal to or greater than MIN_DP (since by definition MIN_DP is the lowest DP observed in the group, while DP is an average). Finally, I don't understand what you mean by "expect the GQ value to go down, not up". Up or down from what? If you mean relative to the single-site records, keep in mind that those are usually on their own because there is something there that is unlike the surrounding sites -- some of them better, some of them worse. Given that is all that differentiates the two sets, I don't think you can infer much from comparing just a handful of sites.

Geraldine Van der Auwera, PhD

• Posts: 12Member
edited December 2014

@Geraldine_VdAuwera‌ I totally didn't notice the single site vs. multi-site, thank you. For a given block in the same GQ band, I would expect that the reported GQ would be equal to or less than the second-likeliest PL. If the GQ that is being reported is the lowest GQ for that GQ band, should the PL not also be the lowest PL in the band? I guess the question for the developers would be "How is PL calculated for a given band?" and "Why would a GQ be higher than the second-likeliest PL".

Post edited by pwhiteusa on

@pwhiteusa Yep I need to check how the band PL is determined -- we might have an inconsistency here.

Geraldine Van der Auwera, PhD

@pwhiteusa After discussing this with the developer, we found that there is indeed an inconsistency -- when summarizing sites in a reference block, the program takes the median GQ, but the mean of the PLs, which is why you are seeing different values. I've put in a request to have this made more consistent in the next version.

Geraldine Van der Auwera, PhD

• Posts: 22Member

Hi,

I have followed the Best practice (first HaplotypeCaller with the -GVCF option, then GenotypeGVCFs) and version3.2 for my study. But I got a record like this:
chr1 802320 rs12565310 G A 6702.90 . BaseQRankSum=1.95;DB;DP=805;FS=3.582;MLEAC=2;MLEAF=1.00;MQ=43.69;MQ0=0;MQRankSum=-1.700e-02;QD=30.47;ReadPosRankSum=1.01;SOR=3.363 GT:AD:DP:GQ:PL ./.:298,0:298 1/1:2,218:.:99:6729,618,0 ./.:279,0:279

As far as I am concerned, both the first and third individual got fairly sequencing depth at this position (298 and 279, respectively), but the genotype was ./. . I thought that genotype would represent like this unless there were no reads cover the site.

I also check the gVCF file for each sample on this site, though @Geraldine_VdAuwera has suggested that we shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Anyway here are the records from the gvcf:

chr1 802320 . G . . END=802320 GT:DP:GQ:MIN_DP:PL 0/0:298:0:298:0,0,0

chr1 802320 . G . . END=802320 GT:DP:GQ:MIN_DP:PL 0/0:279:0:279:0,0,0

Can anyone give me a hint on this? Thanks in advance.

Hi @chenyu600 ,

You're right that no-calls (./.) are usually reserved for sites with no read coverage. But there's clearly something odd going on at these sites -- you see those two samples have PLs of "0,0,0" which means that the program has no clue what is going on there. Have you looked at the site in a genome browser like IGV? That may shed some light on what's happening here.

Geraldine Van der Auwera, PhD

• Posts: 22Member

Thank you @Geraldine_VdAuwera !

I would check it out and come back to you as soon as possible!

• Posts: 22Member

I have check the BAM file for each sample in IGV. I just looked into the bam file after performing BWA and Picard MarkDuplicates. Since the the first (II5A) and third (III3A) individual did not called at this position, I think there is no need to look into the bam generated by --bamOutput.

Here is the depth for each sample (from IGV):
II5A
Total count: 840
A : 577 (69%, 291+, 286- )
C : 10 (1%, 1+, 9- )
G : 252 (30%, 111+, 141- )
T : 1 (0%, 0+, 1- )

## N : 0

II7A
Total count: 654
A : 460 (70%, 240+, 220- )
C : 7 (1%, 0+, 7- )
G : 186 (28%, 70+, 116- )
T : 1 (0%, 0+, 1- )

## N : 0

III3A
Total count: 666
A : 500 (75%, 273+, 227- )
C : 4 (1%, 0+, 4- )
G : 156 (23%, 67+, 89- )
T : 6 (1%, 0+, 6- )

## N : 0

It seems that each sample got pretty good depth for this site (chr1:802320).

I also attach the bam files. These files are just alignment from chr1:801320-803320.

The files were not attached -- this forum can only handle plain text and pictures. We prefer that you post a screenshot of your data viewed in IGV.

Please also check the mapping qualities of your reads and also the base qualities. Lots of depth is useless if it's not good quality.

Geraldine Van der Auwera, PhD

• Posts: 22Member
edited March 26

Sorry for the delay response!
Here is one of the sample that did not call at chr1:802320, but with reads covered it.
Statistics from IGV is below:
Total count: 840
A : 577 (69%, 291+, 286- )
C : 10 (1%, 1+, 9- )
G : 252 (30%, 111+, 141- )
T : 1 (0%, 0+, 1- )
N : 0

while information grep from result generated by HaplotypeCaller with the -GVCF option is:
chr1 802320 . G . . END=802320 GT:DP:GQ:MIN_DP:PL 0/0:298:0:298:0,0,0

ASAIK, "DP" value (298) in this record is based on filtered reads counts, which is much smaller than that from IGV (840).

https://dropbox.com/sh/y3xxmaqygxg9s9g/AACGB848dEUkruJNm2TfA8Qla?dl=0

Post edited by chenyu600 on

@chenyu600 It looks like this region has a big tandem duplication in the sequence, so it's very likely that after reassembly, the mapping looks different. If you can share a snippet of data we can try to figure out exactly what's happening here. Instructions are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

Geraldine Van der Auwera, PhD

• Posts: 22Member

I have followed the instruction and uploaded my dataset. The archive file is chenyu600_chr1_802320.tar.gz.
There are three samples, II5A, II7A and III3A. All the bam are snippets of the original chromosome 1 bam files. To be exactly, they are bam files (II5A_snippet.bam, for example) with 500 bp padding on either side of chr1:802320. Using these files can reproduce almost the same result except that the annotations of the record are a bit different.

The original result is :
chr1 802320 rs12565310 G A 6702.90 . BaseQRankSum=1.95;DB;DP=805;FS=3.582;MLEAC=2;MLEAF=1.00;MQ=43.69;MQ0=0;MQRankSum=-1.700e-02;QD=30.47;ReadPosRankSum=1.01;SOR=3.363 GT:AD:DP:GQ:PL ./.:298,0:298 1/1:2,218:.:99:6729,618,0 ./.:279,0:279

And the new result is :
chr1 802320 rs12565310 G A 6811.90 . BaseQRankSum=1.70;ClippingRankSum=1.20;DB;DP=807;FS=6.140;MLEAC=2;MLEAF=1.00;MQ=44.03;MQ0=0;MQRankSum=0.811;QD=30.68;ReadPosRankSum=-9.830e-01;SOR=34.852 GT:AD:DP:GQ:PL ./.:297,0:297 1/1:1,221:222:99:6838,654,0 ./.:281,0:281

In both results, II5A and III3A were not genotyped even though the site has been covered by almost 300 reads. But the annotation values are somehow different. Besides, I wonder why the values of DP presented in the new results are more than those presented in the original results (Should they equal to or less than the original ones since there are fewer data). And in the orignial result, DP of the second individual is ".", I think, which means that all the reads covering this site were filtered out. However, in the snippet dataset, 222 reads are supposed to pass the filter. I am kind of lost. Would you giving me a hint on these too?

There is one more thing confused me. When running on the orignial chromosome 1 bam files, I set -Xmx 5G for each step and that ran pretty well. But when I ran the snippets, errors kept promting up saying "There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java". And I have to set -Xmx up to 9G to make HaplotypeCaller run on. I thought memory may go up when feeding the tools with big data. But this time it required more memory when processing fewer data (as you can see the exact command lines in the run.sh, and run.sh.e451547 is the log file). Would you mind explaining that to me?

Thank you!

#### Issue · Github March 30 by Geraldine_VdAuwera

Issue Number
906
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

@chenyu600, thanks for the snippet. I'm not sure what's going on with the depth; we'll start by checking out the data you sent and let you know.

I can't explain the higher memory requirement you encountered running on the smaller dataset, sorry. Would be interesting to know if this is reproducible or if it was perhaps a transient issue on your platform.

Geraldine Van der Auwera, PhD

• Posts: 22Member

@Geraldine_VdAuwera, hope you can work it out soon
Thank you!

@chenyu600
Hi,

-Sheila

• BarcelonaPosts: 27Member

Hi @Geraldine_VdAuwera. On this thread we have two answers as to how the GQ in a band of a gVCF is produced:

November 2014: IIRC, the GQ of a block is the lowest GQ of the sites that are all within the same GQ range (as defined in the vcf header).
December 2014: After discussing this with the developer, we found that there is indeed an inconsistency -- when summarizing sites in a reference block, the program takes the median GQ, but the mean of the PLs, which is why you are seeing different values. I've put in a request to have this made more consistent in the next version.

I assume the second is correct, but does that mean that the gVCF is being represented wrongly - i.e. the median GQ is being used to determine bands rather than the more accurate mean PL? 1) How will these be made "more consistent" - will the GQ become equivalent to the second-lowest value in the PL, as for variant positions?

This is important because I notice that when going from 100 gVCFs (-GQB 20) to post-VQSR 100-sample VCF, the GQ in the latter always gets set to it's PL equivalent, often resulting in a huge change in "GQ" e.g. the following, GQ of 99 in the homozygous REF bands, gets reduced to just 21 and 23 when they are treated as calls:

In the gVCF:
19      39062815   .   G   C,<NON_REF>     3131.77 PASS    BaseQRankSum=0.328;ClippingRankSum=1.256;DP=243;
19      39062522   .   A   <NON_REF>   0.0 .   END=39062984    GT:DP:GQ:MIN_DP:PL     0/0:83:99:7:0,21,242
19      39062568   .   G   <NON_REF>   0.0 .   END=39062940    GT:DP:GQ:MIN_DP:PL      0/0:69:99:8:0,23,270


In the post-VQSR VCF:

    19      39062815        .       G       C       3117.43 VQSRTrancheSNP90.00to99.00      AC=1;AF=4.545e-03;AN=220;
BaseQRankSum=0.328;ClippingRankSum=1.26;DP=1331;FS=3.097;InbreedingCoeff=-0.0077;LikelihoodRankSum=1.53;
0/1:126,116:242:99:3160,0,3374  0/0:7,0:7:21:0,21,242   0/0:8,0:8:23:0,23,270


2) Is this expected behaviour?

The point is, as shown by the DP in the original, and having looked at the BAMs, that this position is very well supported - this is a real deNovo in this trio - at least it Sangered positive, but if I saw a read depth of 8 and a GQ of 23 in the homozygous REF calls, I would usually ignore it - also notice that it doesn't get a "PASS". Further, here the information in the gVCF is more reliable, but it is lost by the time VQSR has been performed.

I guess that adding more banding to my gVCFs might help to resolve this - certainly the DP would be expected to improve (since it is the min_DP that passed to the post-VQSR VCF). 3) Is the DP reported in the gVCF the mean, or the median, do you know?

Hi @Sheila - 4) can you explain how the issue was "fixed" and if this impacts on my questions above?

Sorry to ask so many questions, but at least they are clearly numbered. :-)

Thanks as always for the great support you are providing, Steve

• Posts: 22Member

Hi @Shelia , thank you very much for the work! Following your instruction, I have downloaded the GATK version3.3 and tested it on my dataset. It worked pretty well this time and I think the problem have been fixed. Thanks again!

But I am now a little confused by the coverage depth for this position. The depth (view from IGV) for this position from the origianl BAM file (attained from BWA + de-dup by Picard + recalibrated by GATK)for each sample were these:

sample 1(II5A):
Total count: 840
A : 577 (69%, 291+, 286- )
C : 10 (1%, 1+, 9- )
G : 252 (30%, 111+, 141- )
T : 1 (0%, 0+, 1- )

## N : 0

sample 2(II7A):
Total count: 654
A : 460 (70%, 240+, 220- )
C : 7 (1%, 0+, 7- )
G : 186 (28%, 70+, 116- )
T : 1 (0%, 0+, 1- )

## N : 0

sample 3(III3A):
Total count: 666
A : 500 (75%, 273+, 227- )
C : 4 (1%, 0+, 4- )
G : 156 (23%, 67+, 89- )
T : 6 (1%, 0+, 6- )

## N : 0

But the depth (either DP or AD) in the final VCF file were much less than the original one. Here is the record grep from the final VCF produced by GATK version 3.3:

chr1 802320 rs12565310 G A 22078.90 . BaseQRankSum=2.38;ClippingRankSum=0.195;DB;DP=778;FS=6.999;MLEAC=6;MLEAF=1.00;MQ=42.26;MQ0=0;MQRankSum=0.466;QD=29.01;ReadPosRankSum=1.84;SOR=1.925 GT:AD:DP:GQ:PL 1/1:6,271:277:99:7657,690,0 1/1:2,218:220:99:6744,638,0 1/1:6,258:264:99:7704,655,0

You see sample 1 with depth of 277, sample 2 with 220 and sample 3 with 264. I know that HaplotypeCaller will Downsample the data to 250x, and that some reads with bad qualtiy will be discarded. I also make HaplotypeCaller output the BAM files to which assembled haplotypes were written. The depth (view from IGV) for this position this time were below:

sample 1 (II5A)
Total count: 280
A : 272 (97%, 193+, 79- )
C : 1 (0%, 0+, 1- )
G : 7 (3%, 3+, 4- )
T : 0

## N : 0

sample 2 (II7A)
Total count: 224
A : 219 (98%, 162+, 57- )
C : 1 (0%, 0+, 1- )
G : 3 (1%, 0+, 3- )
T : 1 (0%, 0+, 1- )

## N : 0

sample 3 (III3A)
Total count: 268
A : 259 (97%, 179+, 80- )
C : 0
G : 7 (3%, 3+, 4- )
T : 2 (1%, 0+, 2- )

## N : 0

I have two questions:

1. Will the downsample step affect the result? I mean, is there ever a chance that most reads supporting the reference base are discarded(for example, in sample 1, it discarded 240 out of 252 reads supporting G and 250 out of 577 reads A, so that the genotype would turn out to be 1/1, since G is the reference base) ?

Thank you!

• BarcelonaPosts: 27Member

Hi again @Geraldine_VdAuwera - I see I may have misunderstood what you were saying about the band above (November) - of course the boundaries of the band are determined by the lowest GQ, by definition. I have now run HC specifying 5 or 10 GQBands, and that has helped with the final DP and GQ values substantially, at least in this instance, and only increased HC run-time by ~10%.

I am still curious about my other questions.

Thanks, Steve

@SteveL

1)-2) You're doing the right thing -- we've actually increased the granularity of the default GQ bands in the upcoming version (3.4, to be released soon-ish).

3) I think it's the mean DP but I could be wrong. Will have to check.

4) I believe Sheila was talking to the other user who has been active in this thread -- that's why for more complex questions that are likely to take some back and forth, it's usually better to create a new discussion than to post a comment in a document thread.

Geraldine Van der Auwera, PhD

@chenyu600 The downsampling can have marginal effects on results, but nothing dramatic. If you're seeing 240 out of 252 reads being discarded, that's not due to downsampling. It's more likely that it's due to quality issues in your reads and they're being filtered out.

I don't know why the count of AD is off by one, can you show a screenshot of this?

Geraldine Van der Auwera, PhD

• BarcelonaPosts: 27Member

@Geraldine_VdAuwera said:
SteveL

1)-2) You're doing the right thing -- we've actually increased the granularity of the default GQ bands in the upcoming version (3.4, to be released soon-ish).

3) I think it's the mean DP but I could be wrong. Will have to check.

4) I believe Sheila was talking to the other user who has been active in this thread -- that's why for more complex questions that are likely to take some back and forth, it's usually better to create a new discussion than to post a comment in a document thread.

Thanks Geraldine. I realised Sheila was talking to @chenyu600 , but I wasn't sure exactly how the issue was fixed, and whether it would mean I should switch to downloading a nightly version, rather than using the original 3.3?

Sorry I didn't start a new thread ;-( - thought my original queries were relevant to the tool, but in future I'll post a new thread.

Oh ok, I guess I was the one confused then

That's right, anytime we say something is fixed, we mean in the latest nightly... unless we recently released a new version (which is going to happen sooooon).

You don't absolutely have to create a new thread everytime, but a good rule of thumb is if you find yourself writing more than one paragraph, it should probably be a new question. Document threads are meant more for clarification of specific points of the doc rather than reporting problem. But sometimes things are not obviously one or the other, so don't worry too much about it. We'll still answer you either way

Geraldine Van der Auwera, PhD

@SteveL Actually, in this case, version 3.3 has the fix. @chenyu600 was using version 3.2. Sorry for the confusion. I guess I need to be more specific about that in the future.

@chenyu600 I will have a look at your files by Monday and get back to you with a proper response.

-Sheila

• Posts: 22Member

Hi @Geraldine_VdAuwera ,
I am very sorry for the late response, but I wonder what screenshot you would like to see? Do you mean the screenshot of IGV result? If it is what you mean, the screenshot was below. Hope this help.

@Sheila , thank you for your work and I hope the screenshot would help. And if you want any other dataset, just mention that;)

https://www.dropbox.com/s/vm93stuhxtvfqio/igv_depth.png?dl=0