The current GATK version is 3.3-0

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

What is a GVCF and how is it different from a 'regular' VCF?

edited October 2014 in FAQs

Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 1Member

Hi, would you please clarify how the DP, GQ and PL fields are summarized for each run? Thanks!

• Posts: 9Member

Hello! I think I might have a problem in catching how the gVCF really works. WHat I mean is that I followed all the steps on hoe to call variants on cohorts of samples with HC so I obtained a final VCF containing the variants of something like 33 samples. But I noticed that the size is really small (around 122 Mb, which is about the size of a single gVCF that I obtained from HC). Then I performed the VSQR anf following annotation, but I think something went wrong since I end up with a file in which I can't distinguish the differend individuals anymore, moreover the number of variants does seem to me quite small in respect to what I expected. I think I made a really huge mistake somewhere, but I really can't figure out where! Could you help me?
Thank you!!

Hi @Giusy,

If you end up with a file in which you can't distinguish the different individuals anymore, there is probably a problem with the read groups assigned to each sample. It's a pretty common mistake, if you assigned read group information that is not distinct for the various samples (especially the SM tag) then the program merges the information.

Geraldine Van der Auwera, PhD

• Posts: 9Member

Dear Geraldine, thank you for your input. I definitely check but I'm quite sure this is not the problem. I added the read groups (including the SM)myself to my sample and I'm pretty sure theone I downloaded from 1000genomes have it already. Moreover ig i look up the output from GenotypeGVCFs in igv I can see all the different samples. Just that when i run the VSQR and annovar as I said I end up with a text file in which I can't distinguish them no longer and the variations are somenthing like 38000, which seems pretty low number to me considering the 34 samples. Is there an other step that I sould perform to extrapolate my samples before annotating it?

Ah, then something else is probably wrong, but I'll need more information from you in order to help you diagnose it. Can you please post all the commands you ran, and clarify which file is the one in which you can longer distinguish the samples? If you can post a few records from that file as well that would be very helpful.

Geraldine Van der Auwera, PhD

• Posts: 9Member

Sure! These are all the commands I used so far. For simplicity I just omitted all the folder paths

java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I my_sample.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -R$REF/hs37d5.fa --dbsnp $REF/dbsnp_138.b37.vcf -L$REF/S04380110_Regionsb37.bed -o $SAMPLES/output.raw.snps.indels.g.vcf java -Xmx2g -jar$GATK/GenomeAnalysisTK.jar -R $REF/hs37d5.fa -T GenotypeGVCFs --max_alternate_alleles 10 --variant output1.raw.snps.indels.g.vcf --variant output2.raw.snps.indels.g.vcf ..... -o global_output.vcf java -jar$GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_output.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0$REF/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $REF/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0$REF/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $REF/dbsnp_138.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R java -jar$GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R $REF/hs37d5.fa -input global_output.vcf -mode SNP --ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o global_recalibrated_snps_raw_indels.vcf java -jar$GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R $REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0$REF/Mills_and_1000G_gold_standard.indels.b37.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R

java -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration -R$REF/hs37d5.fa -input global_recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -o global_recalibrated_variants.vcf

perl $ANNOVAR/convert2annovar.pl -format vcf4 --includeinfo global_recalibrated_variants.vcf > snp.avinput perl$ANNOVAR/annotate_variation.pl -geneanno $SAPLES/snp.avinput -buildver hg19 humandb/ perl$ANNOVAR/table_annovar.pl snp.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp138,ljb23_all -operation g,r,r,f,f,f,f -nastring .

And this is a capture of the record. I actually noticed that something was wrong looking at it in Excel

Ok, well it looks like you've done everything correctly for the GATK part; assuming the output of VQSR is ok, it looks like the problem is in the ANNOVAR output. I can't comment on Annovar's expected behavior since it's not our tool, so you should check with that tool's support at http://www.openbioinformatics.org/annovar/annovar_faq.html#bug

Geraldine Van der Auwera, PhD

• Posts: 9Member
edited September 2014

Ok, well anyway it's good news that the GATK part is fine. I'll try to look better into annovar... Thank you very much for your quick help!

Post edited by Giusy on
• Posts: 228Member

Hi, Geraldine,
I wonder if you can help me with the following?

In the following portion of a gVCF file,

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1
1       1       .       N       <NON_REF>       .       .       END=10353       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0


Since the DP is 0, the GT should be No Call, i.e. ./.; why is it 0/0?
Also since each line may represent a block, does it mean the stats are the mean of all sites in the block?

A 2nd question, the doc of GenotypeGVCFs says it “produce correct genotype likelihoods, re-genotype the newly merged record”, does it mean it may change the GT of sites in a sample? For example, from A/T in the gVCF file to A/G after GenotypeGVCFs on a cohort containing the sample, or from No Call to a call?

Thanks,

Hi @blueskypy,

Funny thing, Craig came to me with the same question this morning...

Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final. In this kind of record, this will end up being a no-call (since REF is N) once it goes through GenotypeGVCFs. The only reason there is a genotype call made here is a leftover bit of code that we are going to take out (hopefully in the next release). In fact, in the near future HC will emit no-call genotypes for every position in a GVCF, to remove this source of confusion. The behavior of regular (non-ERC) mode HC will not be affected.

To your second question, yes, genotype calls can be changed by GenotypeGVCFs.

Does that clarify things?

Geraldine Van der Auwera, PhD

• Posts: 228Member

Thanks so much, Geraldine! So to get the GT of all sites in a sample, I should use GenotypeGVCFs --includeNonVariantSites? if so, two more questions:
1. does it has an option similar to -ERC GVCF or BP_RESOLUTION for those non-variant sites?
2. Does it include both 0/0 and ./. sites? if not, does it mean that sites absent in the vcf output are all ./. sites? Is there a way to include ./. sites as well?

Finally, not related but just wondering, does --standard_min_confidence_threshold_for_calling in GenotypeGVCFs refer to the GQ in GVCF file?

Thanks a lot!

• Posts: 228Member

A followup question, my understanding is that VQSR only uses variant sites, so whether --includeNonVariantSites or not in GenotypeGVCFs has no effect on VQSR, is it right?

Whoo that's a lot of questions. Yes, no, yes (I think). No, that arg refers to the variant confidence (QUAL). And yes, VQSR will ignore monomorphic hom-ref sites.

Geraldine Van der Auwera, PhD

• Posts: 6Member

Hi, I'm curious if I made a gvcf using 2 samples (1 normal, 1 cancer) from the same individual, could I use this resulting vcf file directly with snpEff?

@naymikm‌, the GVCF is only an intermediate format. You must run GenotypeGVCFs on it before doing any further processing such as functional annotation with snpEFF.

Geraldine Van der Auwera, PhD

• Posts: 6Member

Sorry I worded that poorly. Would the resulting vcf file after running GenotypeGVCFs be usable with snpEFF directly?

Hello,

Yes, your vcf will work with snpEFF.

-Sheila

• Posts: 12Member

Please could you explain how GQ is calculated in the GVCF format for a given block. I noticed that in cases where the MIN_DP value is less than the DP value, GQ is always significantly higher than the PL for a 0/1 call. Why would that be the case?

20 10000759 . A . . END=10001018 GT:DP:GQ:MIN_DP:PL 0/0:40:99:28:0,65,1080

Why is GQ not 65? Note that DP=40 and MIN_DP=28?

Hi @pwhiteusa,

I'm not sure what you mean. The PL values are normalized so that the PL of the called genotype is 0 (see this FAQ for details). The PL 65 is for the het possibility, but the record you mention is hom-ref.

Geraldine Van der Auwera, PhD

• Posts: 12Member
edited November 2014

Here are four examples where GQ is high:

20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315


And four examples where it is not:

20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315


What I have observed is that in GVCF format, a block of reference calls typically is given a GQ of the 0/1 PL, but only if DP=MIN_DP.

If Min_DP < DP, then the GQ value is always higher than the 0/1 PL. You see the effect more clearly if there is a lower coverage sample, but essentially the higher the delta between the MIN_DP and DP value then the greater the GQ.

Q1. Why is that the case?

Q2. How is GQ calculated in the GVCF format for a given block?

Post edited by Geraldine_VdAuwera on

@pwhiteusa I don't see the trends you are seeing. The first 4 records, which you say have high GQ, have GQs of 99, 99, 12, 42. The other 4 have GQs of 45, 26, 25, 49. So two of the first four are really in the wrong category. And for the second category, the DP=MIN_DP observation does not apply for the first element. I think you may be seeing coincidental similarities.

Generally speaking, it's not surprising that you might see larger GQs when the MIN_DP and DP are farther apart, because that suggests that there are some samples with even more coverage than those already respectable depth values. With that amount of depth, you can get a good amount of confidence in the call.

IIRC, the GQ of a block is the lowest GQ of the sites that are all within the same GQ range (as defined in the vcf header).

Geraldine Van der Auwera, PhD

• Posts: 12Member
edited December 2014

@Geraldine_VdAuwera here's what I am observing in the examples.

When DP = Min_DP the GQ value is always the second highest PL value (or very close to it). This for the examples above in the second set of four:

GQ = 0 and PL = 0,0,1121

GQ = 72 and PL = 0,72,1080

GQ = 37 and PL = 0,37,909

GQ = 21 and PL = 0,21,315

Now, lets look at the first block where DP > Min_DP. This is no longer the case:

GQ = 99 and PL = 0,95,1226 Note: DP - Min_DP = 4

GQ = 99 and PL = 0,120,1800 Note: DP - Min_DP = 3

GQ = 12 and PL = 0,9,135 Note: DP - Min_DP = 3

GQ = 42 and PL = 0,21,315 Note: DP - Min_DP = 2

Do you see the trend? When Min_DP is less than DP, the GQ value is always higher than the second highest PL value - why would that be the case?

For example, in the 4th example, where DP and Min_DP are both 49, the GQ (21) is the same as the 0/1 PL (21). Now look at the last example (the 8th) - GQ (42) is double the 0/1 PL (21) and DP is actually less (39-41) in this case.

So I guess it ultimately comes down to the answer you gave to question 2 above:

How is GQ calculated in the GVCF format for a given block?: "the GQ of a block is the lowest GQ of the sites that are all within the same GQ range". If that were the case, in situations where DP > Min_DP would you not expect the GQ value to go down, not up?

Post edited by pwhiteusa on
• Posts: 12Member
edited December 2014

Sorry @Geraldine_VdAuwera‌ could you also explain your comment: "because that suggests that there are some samples with even more coverage than those already respectable depth values" - does MinDP refer to the depth from other samples - or did you mean bases?

Post edited by pwhiteusa on

@‌pwhiteusa Thanks for clarifying -- I think I understand better now what you mean, but the "trend" is just the difference between single-site blocks (your first set) and multi-site blocks (your second set), which you can tell from the END tag in your earlier post.

For the first set, you can read the PL and GQ as if it was a regular (not GVCF) record: the GQ is equal to the second-likeliest PL cue to how the normalizing is done; that is expected. In the other set, there are other sites with GQs that may be different, although they should still be in the same GQ band (as defined above), so that's why you can see a slight mismatch between the GQ and the second-likeliest PL. I'll check with the developers whether it makes sense that the rule for integrating the values from multiple sites is not exactly the same for different fields, since that seems to be causing the confusion here.

To your other question, I mistyped my answer -- I did mean sites, not samples, sorry. Note that when you have a multi-site block, DP should always be either equal to or greater than MIN_DP (since by definition MIN_DP is the lowest DP observed in the group, while DP is an average). Finally, I don't understand what you mean by "expect the GQ value to go down, not up". Up or down from what? If you mean relative to the single-site records, keep in mind that those are usually on their own because there is something there that is unlike the surrounding sites -- some of them better, some of them worse. Given that is all that differentiates the two sets, I don't think you can infer much from comparing just a handful of sites.

Geraldine Van der Auwera, PhD

• Posts: 12Member
edited December 2014

@Geraldine_VdAuwera‌ I totally didn't notice the single site vs. multi-site, thank you. For a given block in the same GQ band, I would expect that the reported GQ would be equal to or less than the second-likeliest PL. If the GQ that is being reported is the lowest GQ for that GQ band, should the PL not also be the lowest PL in the band? I guess the question for the developers would be "How is PL calculated for a given band?" and "Why would a GQ be higher than the second-likeliest PL".

Post edited by pwhiteusa on