We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Difference between vcf directly generated by HC and vcf generated from GenotypeGVCFs


I have three general questions about using HaplotypeCaller (I know I could have tested by myself, but I figured it might be reliable to get some answer from people who are developing the tool):

  1. For single sample analysis, is the vcf generated directly from HC the same as the vcf generated using GenotypeGVCFs on the gvcf generated from HC?
  2. For multi-sample analysis, in terms of speed, how is the performance of running GenotypeGVCFs on each gvcf, compared with combining all gvcfs to run joint-calling, assuming we can get all gvcfs in parallel (say for 500 samples)?
  3. It seems the gvcf can be generated in two modes, -ERC GVCF or -ERC BP_RESOLUTION. How different is the one generated using -ERC BP_RESOLUTION different from a vcf with all variant calls, reference calls and missing calls? And considering the size of the file, say for NA12878 whole genome, how different it is comparing the gvcf from -ERC GVCF and the one from -ERC BP_RESOLUTION?

Thank you very much for you attention and any information from you will be highly appreciated.


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    1) Yes. The two VCFs will be the same.

    2) GenotypeGVCFs is very fast and is meant to be run on many GVCFs together. However, we do recommend running CombineGVCFs for hierarchical merging of GVCFs if you have a very large sample size.
    If that does not answer your question, perhaps this document will.

    3) The GVCF is meant to be an intermediate file that will be input to GenotypeGVCFs. The two types of GVCFs contain information that is not included in an all sites VCF. The information is used in joint-genotyping. I think these two articles will help you: https://www.broadinstitute.org/gatk/guide/article?id=4017
    As for file size, it really depends on the quality of your reads. The GVCF groups reference blocks based on the genotype quality scores. You can always run on a small section of your file in both GVCF mode and BP_RESOLUTION mode and extrapolate that number to how big your interval size is. This thread may help too.


  • jianljianl Member

    Hi @Sheila,

    Thanks a lot for you replies.

    I did a test myself regarding my first question, on a 24X human genome, the resulted vcfs (regular vcf from HC, and vcf by genotyping gvcf from HC) are different. The result from gvcf had more unique calls (>1000) than the regular vcf (>500). Could you explain a little bit more what is genotypeGVCFs does with one sample?

    In terms of question 2, I'm further wondering, say if I have 1000-2000 samples, is it necessary that I run CombineGVCFs first before GenotypeGVCFs? Will the result from GenotypeGVCFs on the combined gvcf be the same with the result from running GenotypeGVCFs with all gvcfs as inputs directly?

    In addition, how long would you estimate to run CombineGVCFs on 1000-2000 samples on a typical desktop computer (2Ghz CPU, 8G memory, etc.)? What is the relationship between the speed increase and the sample number increase (I know it should be non-linear, but do you have more information?) And is there any speed difference if I run GenotypeGVCFs on a combined gvcf vs. on all gvcf as input directly (if I can)?

    Thanks a lot for your information!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, did you check that the emit and call confidence thresholds were the same for HC and GGVCFS? The defaults may not be the same. That could have a big effect on what is output.

    Yes, we recommend running CombineGVCFs hierarchically so that GenotypeGVCFS doesn't have to open more than 200 files at most.

    We don't currently provide resource utilization guidelines, sorry. But yes, GGVCFS will be faster and consume less memory if run on fewer files.

  • jianljianl Member

    HI @Geraldine_VdAuwera,

    Thanks a lot for your reply!
    Yes I set "--standard_min_confidence_threshold_for_emitting" the same (0.1) for both HC and GGVCFs, and I didn't explicitly specify the call confidences which should therefore be using the default value. Are the default call confidences the same for the two tools?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, both are set at Q30.

    When you look at the calls made by GGVCFs but not by HC, are they all in a very low range of call confidence? Having a few hundred very low-conf calls be different over a whole genome is not unexpected.

Sign In or Register to comment.