The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# combine snp and indel vcf

ViennaPosts: 2Member

is there any way to combine a snp vcf and indel vcf (generated with the UnifiedGenotyper) later? in the way that there is only one row per locus?

regardless how I combine (I tried mainly CombineVariants), if there is something different called in the two vcf files in one locus, there are two rows in the combined one; I would like this called/written as alternatives for one locus

Tagged:

What you're trying to do is not possible; SNPs and Indels should always be on different lines since they are different events, even if they start at the same position.

Geraldine Van der Auwera, PhD

• ViennaPosts: 2Member

shouldn't they be on the same row if they are mutual exclusive? (ok, this never happens with insertions, and never with deletions at the same coordinate (because deletions are starting with the next base), but a deletion could be mutual exclusive to a snp)

No, this is not a question of whether or not they are mutually exclusive, the problem is that they are different kinds of variation. They are like cubes and spheres, you can't store them together on the same shelf.

Geraldine Van der Auwera, PhD

• Posts: 683GATK Developer mod

Just to correct the answer slightly: SNPs and indels certainly can be merged together into a single record in theory (in fact, Combine Variants used to do this). It is allowed by the VCF specification and not unreasonable. However, it is technically very complicated to get it right. In particular, complex substitutions are virtually impossible to merge correctly. So instead of spending tons of hours trying to fix the process that always seemed to be broken, we just decided not to allow merging of different variant types anymore. It's a less flexible choice, but at least the results are not wrong.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Oops, my bad for being excessively categorical -- thanks for the correction, Eric.

Geraldine Van der Auwera, PhD

• LondonPosts: 19Member
edited October 2013

Hi Folks,

To follow on from this question - please could you further explain the representation of SNPs and indels at the same locus?

For example, for a single diploid sample at a single locus where one chromosome copy has a snp and the other has an indel how does VCF represent this in separate rows without giving two genotypes or four haplotypes at that locus? Can a single VCF row call only one haplotype - is that the difference between GT=./1 (haplo) and GT=0/1 (geno)? So my example would be two rows both with GT ./1 assuming the SNP and indel are both the first alternates?

In the case of HaplotypeCaller since it tries to reassemble the haplotypes and calls both indels and SNPs at the same time are SNP/indel mixed loci better represented than for UnifiedGenotyper? I was hoping this is the case and a major reason for preferring HaplotypeCaller.

This question is particularly perplexing when it comes to using haplotype phasing software - what does e.g. Beagle do for the above loci?

Many thanks,

Tim

• LondonPosts: 19Member
edited October 2013

The VCF spec outlines a delete/snp single row case like this here:

Suppose I see a the following in a population of individuals and want to represent these three segregating alleles:

Ref: a t C g a // C is the reference base
: a t G g a // C base is a G in some individuals
: a t - g a // C base is deleted w.r.t. the
How do I represent this? There are three segregating alleles: { tC , tG , t } with a corresponding VCF record:

20     2 .         TC      TG,T    .   PASS  DP=100


It also says about the GT field:

If a call cannot be made for a sample at a given locus, ”.”should be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype).


So do the GATK callers use this notation to call the separate haplotypes in separate rows for overlapping indel/snp?

Thanks

Tim

UnifiedGenotyper will emit separate VCF records for the SNP and the indel case because they are processed through different models that are not aware of each other. HaplotypeCaller uses a single model, so it should emit them together using the notation you excerpted from the VCF spec. I'm checking with the devs to make sure, stay tuned.

Geraldine Van der Auwera, PhD

• Posts: 122GATK Developer mod

Yes that is exactly right. For the UnifiedGenotyper the SNP and indels would be considered independently. So if the truth for the sample is that one chromosome contains the SNP while the other contains the indel at the same locus then the variation will get represented as two, biallelic, heterozygous (0/1) records at that locus.

The HaplotypeCaller considers SNPs and indels simultaneously and so it would figure out that the two haplotypes which best represent the data are that one haplotype has the SNP allele while the other haplotype has the indel allele. The resulting variation call would be a single multi-allelic record with the genotype as 1/2.

Cheers,

• LondonPosts: 19Member

Hi,

Thanks for your reply (I didn't get a notification so just seen it). This is great - that's the behaviour I was hoping for. I'm going to give it a shot and I'll post if I find cases that fall outside this expectation.