(BP2.1) Calling Variants with HaplotypeCaller in gVCF mode

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

This article is part of the workflow documentation describing the Best Practices for Variant Discovery in DNAseq data. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

Many variant callers specialize in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) have to call them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels.

In addition, the HaplotypeCaller is able to estimate the probability that a given site is non-variant. This is very useful when you want to distinguish between cases where no variant was called because the evidence suggests that the site is non-variant, as opposed to cases where no call could be made either way because there was no data available. This capability, conferred by the reference confidence model, is used in the Best Practices workflow to produce a gVCF (short for genomic VCF) for each sample in a cohort.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD


  • vivekdas_1987vivekdas_1987 MilanPosts: 33Member

    Dear @Geraldine_VdAuwera,

    I have some queries which I would like to ask. I have already asked you a lot of question when I started with the GATK but that time i never had control samples so I was actually working with tumor and its corresponding IPS lines. I have the normal samples now. My idea is to match the normal/tumor and normal/IPS and call the variants that are exclusive to the tumor and IPS and then try to see if the tumor and IPS have mutational consistency or not. Having said that I would like to know that if this can be done by merging the tumor normal and normal-IPS together and then take those bam files for realignment and BQSR step and then carry on with the SNP/INDEL calling. Can you give me the suggestion at which step I should do the merging of the files? Also I would shift to Haplotype caller now for calling the variants, as I had been using UnifiedGenotyper for last few times. I want to asses the power of haplotype caller. Can you tell me if I can call the variants using the HC without providing the dbSNP vcf files? What can be the impact if I donot use a dbSNP vcf file when I am calling the variants? I would say everything will be unknown at that time. I will not be getting the rsid for the variants that are there in the dbSNP database right? But my experiment is to check the consistency between the mutational landscape of tumor and its derived IPS. So anyway even if I donot use the dbSNP for variant calling I will get the variants but I will not be able to asses which are novel and which are known right? Correct me if I am wrong. Also if I proceed that way I should get the variants annotated when I use annovar and annovar uses dbSNP to annotate the variants and I can get the rsid at that time for the known variants. If this happens then it would be great. But I would like to know the impact of the variant calling without using dbSNP vcf files. Does that sound logical? I have not found anyone asking this or might have not come across( i tried extensive searching). I would like to know your thoughts on this. Sorry for posing so many questions.



  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    Hi @vivekdas_1987,

    We don't have any expertise in the cancer field, so we're not the right people to answer this question. I would recommend asking in the MuTect forum since they are experts in cancer/somatic mutations.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 33Member

    Hi @Geraldine_VdAuwera,

    Thank you for the reply but I would like to know is this feasible to call the variants without dbSNP? Then wont it be a problem to distinguish between the novel and the already known variants? This will also be a problem during the prioritizing as I will be a problem to distinguish which are novel mutations right? I would like to know the impact if I donot use dbSNP137.vcf while calling the variants with HC caller.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    Hi Vivek,

    Yes, you can safely call variants without including a dbsnp file; that will have no effect on the results. You can annotate rsIDs later.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 33Member

    Thank you very much for the reply.

  • Bettina_HarrBettina_Harr Posts: 27Member

    Dear Geraldine,
    I am new to GATK, and have just finished my preprocessing of the bam files. Now I am wandering which genotyping algorithm I should use. I read that the Haplotype caller is superior in many ways, but what it does not seem to do is export all sites, and not only the variable ones. Since I will be doing population genetics on the data, I need to know if a site was truly monomorphic in my population, if it for example had only bad quality bases covering it. Would you have a recommendation for me, how I should proceed? Would it make sense to ran both callers, the unified genotype caller for all sites and the haplotype caller for the variable ones?
    Thanks a lot,

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    Hi @Bettina_Harr‌,

    You should use the HaplotypeCaller in gVCF/reference model mode; that will give you exactly what you want. See http://www.broadinstitute.org/gatk/blog?id=3896 for details.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 27Member

    brilliant, thanks!

  • andremrsantosandremrsantos Posts: 10Member

    Hi @Geraldine_VdAuwera‌ ,
    I am analyzing a complete genome sequence, how do I deal with haploid chromosomes, such as mitochondrial, X and Y. Do I perform the normal haplotycaller and process separately after with genotypegvcf or I call them using unifiedGenotyper as suggested for haploid genomes?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    Hi @andremrsantos,

    At the moment our recommendation is to simply process the haploid contigs along with the rest, and filter out any heterozygous calls. We are working on new methods to address the ploidy issue better in an upcoming version.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 207Member ✭✭✭

    @andremrsantos, @Geraldine_VdAuwera; could be just me, but I find that MT always/almost always lands in the fail VQSR tranches for whole genomes no matter whether they were processed with UG or HC. I think the culprit is typically (unsurprisingly) depth. I've never had a reason to go back to them and refilter the calls, but would imagine if necessary that I would use hard filters just for those. I think X and Y work with VQSR "fine" (since they wouldn't have the same characteristics of MT...for every given constitutional DNA in a cell there are tens to thousands of copies of MT), but really haven't looked too closely at either of them.

  • MartaMarta Posts: 11Member

    Hi! I understood that with GVCF mode I could obtain the genotype in all the position of y target. Am I right? I used the follow script:
    java -Xmx5g -jar /path/GATK-3.2-2/GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R $MyRef \
    -D /path/dbsnp_137.b37.excluding_sites_after_129.chr.vcf \
    -I $SampleDir"/"$Sample".srt.RG.localn.qsr.bam" \
    -L $MyTarget \
    -o $OutDir"/"$Sample".g.vcf" \
    -rf BadCigar \
    -stand_call_conf 20 \
    -stand_emit_conf 10 \
    --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000
    but in my otuput I have only the variants and not the genotype at all the positions. Can you help me please?
    I'll attach a part of the output

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    Hi @Marta,

    Do you mean you want calls at every single site, including those that are homozygous-reference in all samples? If so, you need to use --emitRefConfidence BP_RESOLUTION.

    Geraldine Van der Auwera, PhD

  • bhall7bhall7 Posts: 9Member

    Hi there, haven't used GATK in a couple of years and the gVCF mode thing is new to me. We're only working with 6 samples and computational intensity is not an issue -- do you still recommend calling each sample separately, or are we safe to merge them like in the olden days?

    If it's relevant, this is a non-model organism so I'm bootstrapping the "known variants" data. Thanks!

  • 5581681555816815 TNPosts: 12Member

    I have tested several whole genome sequencing data with BAM file size of 80~100GB.
    the gvcf is around 4~5 GB which looks good.

    But today I tried on an whole exome sequence data...the BAM is ~40GB.
    However, the gvcf I got is ~39GB?

    Any comments?


  • SheilaSheila Broad InstitutePosts: 1,405Member, GATK Dev, Broadie, Moderator, DSDE Dev admin


    Sorry for the late response. There shouldn't be any accuracy differences if you use either one. The big difference is that with gVCFs, you don't have scaling issues and you can add samples later on. Also, you can change the genotyping parameters and re-genotype more quickly using GVCFs.


  • SheilaSheila Broad InstitutePosts: 1,405Member, GATK Dev, Broadie, Moderator, DSDE Dev admin
  • 5581681555816815 TNPosts: 12Member

    That make sense....need to learn more about gvcf format.

  • ShreyasiShreyasi USAPosts: 19Member

    Hi ,

    I have completed the pre-processing steps and now for some of my samples, I am in a dilemma. I have two libraries each containing a pool of 3 samples (very, very small sample size) and these two libraries were run on a single lane. I have low coverage (~3-5) for each library. I am trying to figure out what would be the best way to call variants for these two sample pools. Hard-filtering documentation mentioned that it should not be used if my coverage is < ~10x while VQSR documentation suggests it should be used only if I have a large sample size. I understand that my experimental design is quiet flawed here, but I am trying to get as much information as possible from this. Any recommendation would be very helpful at this point. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,027Administrator, GATK Dev admin

    Hi @Shreyasi

    You should call variants on all samples together with HaplotypeCaller, with the -ploidy argument set appropriately. After that, you will have to experiment with the hard filters. Depending on what information you're trying to get, you can be either very strict or very lenient about processing. One thing that can help is to plot the distribution of annotation values, to get a sense of what you're dealing with. If there is any known variation data for your organism (I assume it's non-human?) you can try to look at what are the annotation values for the variants that overlap with the known sites, and adjust the filters accordingly. In any case I'm afraid it's going to take a lot of work to get anything useable out of this. Good luck!

    Geraldine Van der Auwera, PhD

  • ShreyasiShreyasi USAPosts: 19Member

    Thank you @Geraldine for the pointers. Study organism is R.norvegicus for which there is known variation data. Keeping my fingers crossed and hoping for something to come out of this.

Sign In or Register to comment.