The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

combine snp and indel vcf

biolexbiolex ViennaMember Posts: 2

is there any way to combine a snp vcf and indel vcf (generated with the UnifiedGenotyper) later? in the way that there is only one row per locus?

regardless how I combine (I tried mainly CombineVariants), if there is something different called in the two vcf files in one locus, there are two rows in the combined one; I would like this called/written as alternatives for one locus


  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,163 admin

    What you're trying to do is not possible; SNPs and Indels should always be on different lines since they are different events, even if they start at the same position.

    Geraldine Van der Auwera, PhD

  • biolexbiolex ViennaMember Posts: 2

    shouldn't they be on the same row if they are mutual exclusive?
    (ok, this never happens with insertions, and never with deletions at the same coordinate (because deletions are starting with the next base), but a deletion could be mutual exclusive to a snp)

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,163 admin

    No, this is not a question of whether or not they are mutually exclusive, the problem is that they are different kinds of variation. They are like cubes and spheres, you can't store them together on the same shelf.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 692 admin

    Just to correct the answer slightly:
    SNPs and indels certainly can be merged together into a single record in theory (in fact, Combine Variants used to do this). It is allowed by the VCF specification and not unreasonable.
    However, it is technically very complicated to get it right. In particular, complex substitutions are virtually impossible to merge correctly. So instead of spending tons of hours trying to fix the process that always seemed to be broken, we just decided not to allow merging of different variant types anymore. It's a less flexible choice, but at least the results are not wrong.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,163 admin

    Oops, my bad for being excessively categorical -- thanks for the correction, Eric.

    Geraldine Van der Auwera, PhD

  • AdminTimAdminTim LondonMember Posts: 19
    edited October 2013

    Hi Folks,

    To follow on from this question - please could you further explain the representation of SNPs and indels at the same locus?

    For example, for a single diploid sample at a single locus where one chromosome copy has a snp and the other has an indel how does VCF represent this in separate rows without giving two genotypes or four haplotypes at that locus? Can a single VCF row call only one haplotype - is that the difference between GT=./1 (haplo) and GT=0/1 (geno)? So my example would be two rows both with GT ./1 assuming the SNP and indel are both the first alternates?

    In the case of HaplotypeCaller since it tries to reassemble the haplotypes and calls both indels and SNPs at the same time are SNP/indel mixed loci better represented than for UnifiedGenotyper? I was hoping this is the case and a major reason for preferring HaplotypeCaller.

    This question is particularly perplexing when it comes to using haplotype phasing software - what does e.g. Beagle do for the above loci?

    Many thanks,


  • AdminTimAdminTim LondonMember Posts: 19
    edited October 2013

    The VCF spec outlines a delete/snp single row case like this here:

    Suppose I see a the following in a population of individuals and want to represent these three segregating alleles:
    Ref: a t C g a // C is the reference base
       : a t G g a // C base is a G in some individuals
       : a t - g a // C base is deleted w.r.t. the
    How do I represent this? There are three segregating alleles: { tC , tG , t } with a corresponding VCF record:
    20     2 .         TC      TG,T    .   PASS  DP=100

    It also says about the GT field:

    If a call cannot be made for a sample at a given locus, ”.”should be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype).

    So do the GATK callers use this notation to call the separate haplotypes in separate rows for overlapping indel/snp?



  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,163 admin

    UnifiedGenotyper will emit separate VCF records for the SNP and the indel case because they are processed through different models that are not aware of each other. HaplotypeCaller uses a single model, so it should emit them together using the notation you excerpted from the VCF spec. I'm checking with the devs to make sure, stay tuned.

    Geraldine Van der Auwera, PhD

  • rpoplinrpoplin Dev Posts: 122 ✭✭✭

    Yes that is exactly right. For the UnifiedGenotyper the SNP and indels would be considered independently. So if the truth for the sample is that one chromosome contains the SNP while the other contains the indel at the same locus then the variation will get represented as two, biallelic, heterozygous (0/1) records at that locus.

    The HaplotypeCaller considers SNPs and indels simultaneously and so it would figure out that the two haplotypes which best represent the data are that one haplotype has the SNP allele while the other haplotype has the indel allele. The resulting variation call would be a single multi-allelic record with the genotype as 1/2.

    I hope that answers the questions. Let me know if you need any more information.


  • AdminTimAdminTim LondonMember Posts: 19


    Thanks for your reply (I didn't get a notification so just seen it). This is great - that's the behaviour I was hoping for. I'm going to give it a shot and I'll post if I find cases that fall outside this expectation.

    Many thanks for your support,


Sign In or Register to comment.