Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

(BP2.1) Calling Variants with HaplotypeCaller in gVCF mode

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

This article is part of the Best Practices workflow documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

Many variant callers specialize in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) have to call them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels.

In addition, the HaplotypeCaller is able to estimate the probability that a given site is non-variant. This is very useful when you want to distinguish between cases where no variant was called because the evidence suggests that the site is non-variant, as opposed to cases where no call could be made either way because there was no data available. This capability, conferred by the reference confidence model, is used in the Best Practices workflow to produce a gVCF (short for genomic VCF) for each sample in a cohort.

image
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • vivekdas_1987vivekdas_1987 MilanPosts: 20Member

    Dear @Geraldine_VdAuwera,

    I have some queries which I would like to ask. I have already asked you a lot of question when I started with the GATK but that time i never had control samples so I was actually working with tumor and its corresponding IPS lines. I have the normal samples now. My idea is to match the normal/tumor and normal/IPS and call the variants that are exclusive to the tumor and IPS and then try to see if the tumor and IPS have mutational consistency or not. Having said that I would like to know that if this can be done by merging the tumor normal and normal-IPS together and then take those bam files for realignment and BQSR step and then carry on with the SNP/INDEL calling. Can you give me the suggestion at which step I should do the merging of the files? Also I would shift to Haplotype caller now for calling the variants, as I had been using UnifiedGenotyper for last few times. I want to asses the power of haplotype caller. Can you tell me if I can call the variants using the HC without providing the dbSNP vcf files? What can be the impact if I donot use a dbSNP vcf file when I am calling the variants? I would say everything will be unknown at that time. I will not be getting the rsid for the variants that are there in the dbSNP database right? But my experiment is to check the consistency between the mutational landscape of tumor and its derived IPS. So anyway even if I donot use the dbSNP for variant calling I will get the variants but I will not be able to asses which are novel and which are known right? Correct me if I am wrong. Also if I proceed that way I should get the variants annotated when I use annovar and annovar uses dbSNP to annotate the variants and I can get the rsid at that time for the known variants. If this happens then it would be great. But I would like to know the impact of the variant calling without using dbSNP vcf files. Does that sound logical? I have not found anyone asking this or might have not come across( i tried extensive searching). I would like to know your thoughts on this. Sorry for posing so many questions.

    Regards,

    Vivek.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Hi @vivekdas_1987,

    We don't have any expertise in the cancer field, so we're not the right people to answer this question. I would recommend asking in the MuTect forum since they are experts in cancer/somatic mutations.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 20Member

    Hi @Geraldine_VdAuwera,

    Thank you for the reply but I would like to know is this feasible to call the variants without dbSNP? Then wont it be a problem to distinguish between the novel and the already known variants? This will also be a problem during the prioritizing as I will be a problem to distinguish which are novel mutations right? I would like to know the impact if I donot use dbSNP137.vcf while calling the variants with HC caller.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Hi Vivek,

    Yes, you can safely call variants without including a dbsnp file; that will have no effect on the results. You can annotate rsIDs later.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 20Member

    Thank you very much for the reply.

  • Bettina_HarrBettina_Harr Posts: 7Member

    Dear Geraldine, I am new to GATK, and have just finished my preprocessing of the bam files. Now I am wandering which genotyping algorithm I should use. I read that the Haplotype caller is superior in many ways, but what it does not seem to do is export all sites, and not only the variable ones. Since I will be doing population genetics on the data, I need to know if a site was truly monomorphic in my population, if it for example had only bad quality bases covering it. Would you have a recommendation for me, how I should proceed? Would it make sense to ran both callers, the unified genotype caller for all sites and the haplotype caller for the variable ones? Thanks a lot, Tina

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,213Administrator, GSA Member admin

    Hi @Bettina_Harr‌,

    You should use the HaplotypeCaller in gVCF/reference model mode; that will give you exactly what you want. See http://www.broadinstitute.org/gatk/blog?id=3896 for details.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.