We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

combine snp and indel vcf

biolexbiolex ViennaMember

is there any way to combine a snp vcf and indel vcf (generated with the UnifiedGenotyper) later? in the way that there is only one row per locus?

regardless how I combine (I tried mainly CombineVariants), if there is something different called in the two vcf files in one locus, there are two rows in the combined one; I would like this called/written as alternatives for one locus


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    What you're trying to do is not possible; SNPs and Indels should always be on different lines since they are different events, even if they start at the same position.

  • biolexbiolex ViennaMember

    shouldn't they be on the same row if they are mutual exclusive?
    (ok, this never happens with insertions, and never with deletions at the same coordinate (because deletions are starting with the next base), but a deletion could be mutual exclusive to a snp)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, this is not a question of whether or not they are mutually exclusive, the problem is that they are different kinds of variation. They are like cubes and spheres, you can't store them together on the same shelf.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Just to correct the answer slightly:
    SNPs and indels certainly can be merged together into a single record in theory (in fact, Combine Variants used to do this). It is allowed by the VCF specification and not unreasonable.
    However, it is technically very complicated to get it right. In particular, complex substitutions are virtually impossible to merge correctly. So instead of spending tons of hours trying to fix the process that always seemed to be broken, we just decided not to allow merging of different variant types anymore. It's a less flexible choice, but at least the results are not wrong.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oops, my bad for being excessively categorical -- thanks for the correction, Eric.

  • AdminTimAdminTim LondonMember
    edited October 2013

    Hi Folks,

    To follow on from this question - please could you further explain the representation of SNPs and indels at the same locus?

    For example, for a single diploid sample at a single locus where one chromosome copy has a snp and the other has an indel how does VCF represent this in separate rows without giving two genotypes or four haplotypes at that locus? Can a single VCF row call only one haplotype - is that the difference between GT=./1 (haplo) and GT=0/1 (geno)? So my example would be two rows both with GT ./1 assuming the SNP and indel are both the first alternates?

    In the case of HaplotypeCaller since it tries to reassemble the haplotypes and calls both indels and SNPs at the same time are SNP/indel mixed loci better represented than for UnifiedGenotyper? I was hoping this is the case and a major reason for preferring HaplotypeCaller.

    This question is particularly perplexing when it comes to using haplotype phasing software - what does e.g. Beagle do for the above loci?

    Many thanks,


  • AdminTimAdminTim LondonMember
    edited October 2013

    The VCF spec outlines a delete/snp single row case like this here:

    Suppose I see a the following in a population of individuals and want to represent these three segregating alleles:
    Ref: a t C g a // C is the reference base
       : a t G g a // C base is a G in some individuals
       : a t - g a // C base is deleted w.r.t. the
    How do I represent this? There are three segregating alleles: { tC , tG , t } with a corresponding VCF record:
    20     2 .         TC      TG,T    .   PASS  DP=100

    It also says about the GT field:

    If a call cannot be made for a sample at a given locus, ”.”should be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype).

    So do the GATK callers use this notation to call the separate haplotypes in separate rows for overlapping indel/snp?



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    UnifiedGenotyper will emit separate VCF records for the SNP and the indel case because they are processed through different models that are not aware of each other. HaplotypeCaller uses a single model, so it should emit them together using the notation you excerpted from the VCF spec. I'm checking with the devs to make sure, stay tuned.

  • rpoplinrpoplin Member ✭✭✭

    Yes that is exactly right. For the UnifiedGenotyper the SNP and indels would be considered independently. So if the truth for the sample is that one chromosome contains the SNP while the other contains the indel at the same locus then the variation will get represented as two, biallelic, heterozygous (0/1) records at that locus.

    The HaplotypeCaller considers SNPs and indels simultaneously and so it would figure out that the two haplotypes which best represent the data are that one haplotype has the SNP allele while the other haplotype has the indel allele. The resulting variation call would be a single multi-allelic record with the genotype as 1/2.

    I hope that answers the questions. Let me know if you need any more information.


  • AdminTimAdminTim LondonMember


    Thanks for your reply (I didn't get a notification so just seen it). This is great - that's the behaviour I was hoping for. I'm going to give it a shot and I'll post if I find cases that fall outside this expectation.

    Many thanks for your support,


Sign In or Register to comment.