(BP2.1) Calling Variants with HaplotypeCaller in gVCF mode

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

This article is part of the workflow documentation describing the Best Practices for Variant Discovery in DNAseq data. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

Many variant callers specialize in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) have to call them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels.

In addition, the HaplotypeCaller is able to estimate the probability that a given site is non-variant. This is very useful when you want to distinguish between cases where no variant was called because the evidence suggests that the site is non-variant, as opposed to cases where no call could be made either way because there was no data available. This capability, conferred by the reference confidence model, is used in the Best Practices workflow to produce a gVCF (short for genomic VCF) for each sample in a cohort.

image
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • vivekdas_1987vivekdas_1987 MilanPosts: 32Member

    Dear @Geraldine_VdAuwera,

    I have some queries which I would like to ask. I have already asked you a lot of question when I started with the GATK but that time i never had control samples so I was actually working with tumor and its corresponding IPS lines. I have the normal samples now. My idea is to match the normal/tumor and normal/IPS and call the variants that are exclusive to the tumor and IPS and then try to see if the tumor and IPS have mutational consistency or not. Having said that I would like to know that if this can be done by merging the tumor normal and normal-IPS together and then take those bam files for realignment and BQSR step and then carry on with the SNP/INDEL calling. Can you give me the suggestion at which step I should do the merging of the files? Also I would shift to Haplotype caller now for calling the variants, as I had been using UnifiedGenotyper for last few times. I want to asses the power of haplotype caller. Can you tell me if I can call the variants using the HC without providing the dbSNP vcf files? What can be the impact if I donot use a dbSNP vcf file when I am calling the variants? I would say everything will be unknown at that time. I will not be getting the rsid for the variants that are there in the dbSNP database right? But my experiment is to check the consistency between the mutational landscape of tumor and its derived IPS. So anyway even if I donot use the dbSNP for variant calling I will get the variants but I will not be able to asses which are novel and which are known right? Correct me if I am wrong. Also if I proceed that way I should get the variants annotated when I use annovar and annovar uses dbSNP to annotate the variants and I can get the rsid at that time for the known variants. If this happens then it would be great. But I would like to know the impact of the variant calling without using dbSNP vcf files. Does that sound logical? I have not found anyone asking this or might have not come across( i tried extensive searching). I would like to know your thoughts on this. Sorry for posing so many questions.

    Regards,

    Vivek.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @vivekdas_1987,

    We don't have any expertise in the cancer field, so we're not the right people to answer this question. I would recommend asking in the MuTect forum since they are experts in cancer/somatic mutations.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 32Member

    Hi @Geraldine_VdAuwera,

    Thank you for the reply but I would like to know is this feasible to call the variants without dbSNP? Then wont it be a problem to distinguish between the novel and the already known variants? This will also be a problem during the prioritizing as I will be a problem to distinguish which are novel mutations right? I would like to know the impact if I donot use dbSNP137.vcf while calling the variants with HC caller.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi Vivek,

    Yes, you can safely call variants without including a dbsnp file; that will have no effect on the results. You can annotate rsIDs later.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 32Member
  • Bettina_HarrBettina_Harr Posts: 22Member

    Dear Geraldine, I am new to GATK, and have just finished my preprocessing of the bam files. Now I am wandering which genotyping algorithm I should use. I read that the Haplotype caller is superior in many ways, but what it does not seem to do is export all sites, and not only the variable ones. Since I will be doing population genetics on the data, I need to know if a site was truly monomorphic in my population, if it for example had only bad quality bases covering it. Would you have a recommendation for me, how I should proceed? Would it make sense to ran both callers, the unified genotype caller for all sites and the haplotype caller for the variable ones? Thanks a lot, Tina

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Bettina_Harr‌,

    You should use the HaplotypeCaller in gVCF/reference model mode; that will give you exactly what you want. See http://www.broadinstitute.org/gatk/blog?id=3896 for details.

    Geraldine Van der Auwera, PhD

  • andremrsantosandremrsantos Posts: 10Member

    Hi @Geraldine_VdAuwera‌ , I am analyzing a complete genome sequence, how do I deal with haploid chromosomes, such as mitochondrial, X and Y. Do I perform the normal haplotycaller and process separately after with genotypegvcf or I call them using unifiedGenotyper as suggested for haploid genomes?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @andremrsantos,

    At the moment our recommendation is to simply process the haploid contigs along with the rest, and filter out any heterozygous calls. We are working on new methods to address the ploidy issue better in an upcoming version.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 174Member ✭✭✭

    @andremrsantos, @Geraldine_VdAuwera; could be just me, but I find that MT always/almost always lands in the fail VQSR tranches for whole genomes no matter whether they were processed with UG or HC. I think the culprit is typically (unsurprisingly) depth. I've never had a reason to go back to them and refilter the calls, but would imagine if necessary that I would use hard filters just for those. I think X and Y work with VQSR "fine" (since they wouldn't have the same characteristics of MT...for every given constitutional DNA in a cell there are tens to thousands of copies of MT), but really haven't looked too closely at either of them.

  • MartaMarta Posts: 11Member

    Hi! I understood that with GVCF mode I could obtain the genotype in all the position of y target. Am I right? I used the follow script: java -Xmx5g -jar /path/GATK-3.2-2/GenomeAnalysisTK.jar -T HaplotypeCaller \ -R $MyRef \ -D /path/dbsnp_137.b37.excluding_sites_after_129.chr.vcf \ -I $SampleDir"/"$Sample".srt.RG.localn.qsr.bam" \ -L $MyTarget \ -o $OutDir"/"$Sample".g.vcf" \ -rf BadCigar \ -stand_call_conf 20 \ -stand_emit_conf 10 \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 but in my otuput I have only the variants and not the genotype at all the positions. Can you help me please? I'll attach a part of the output

    Thank you

    xlsx
    xlsx
    HaplotCall.xlsx
    7K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Marta,

    Do you mean you want calls at every single site, including those that are homozygous-reference in all samples? If so, you need to use --emitRefConfidence BP_RESOLUTION.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.