Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

UnifiedGenotyper: different -glm value result in different sets of variants?

ihleeihlee Posts: 2Member

Hi, I'm running UnifiedGenotyper with different glm values (BOTH, INDEL, SNP). I expected that the set of variants from glm BOTH is the same as the union of variants from glm SNP and INDEL, but it wasn't. Althought the different was not big (less than 100 variants), I'm curious why there's such a difference and want to know which is better way to find variants (both snps & indels). Thank you.

Best Answer

  • ebanksebanks Posts: 671 mod
    Answer ✓

    The GATK uses randomization for down-sampling high coverage sites. It does, however, use the same initial seed for the randomization (so that's why it is deterministic when running single threaded). Because the genotyper is triggering at a different set of sites between SNP and BOTH, it affects the random numbers used at any given position.

    BOTH mode does not do joint SNP and indel calling, but rather is just a way for users to save IO by only having to read the data once instead of twice. If you want joint calling of all variant types then you should check out the Haplotype Caller which does this.


  • vanheel123vanheel123 Posts: 5Member

    I'd also be interested in the comments from GATK team on this, please.

    Also - apologies as I havent yet tried it - but what would happen at a site with a typical REF/ALT biallelic SNP, but also a single base indel - would this be outputted as a triallelic variant in the vcf with UnifiedGenotype -glm BOTH ?

    regards, david van heel

  • vanheel123vanheel123 Posts: 5Member

    Dear Eric,

    How does HaplotypeCaller perform on single samples? I have very high depth data and UnifiedGenotyper gives great SNP calling results on single samples at a time. Such that I have not even tried multi-sample calling, given issues of batches, memory, etc (I have thousands of samples). As posted above, I am keen to have joint calling of both SNPs and INDELs, with output of reference homozygote, non-ref het and non-ref hom genotypes. I presume "HaplotypeCaller --output_mode EMIT_ALL_CONFIDENT_SITES" will do this.

    Sorry - couldnt find that much description of algorithms behind HaplotypeCaller on searching - may have missed it and would appreciate being pointed in right direction if so!

    regards, david

  • vanheel123vanheel123 Posts: 5Member

    Please ignore the above post - I think I have to use multi-sample calling to get what I want with indels (or feed a single sample caller a list of positions/alleles to call). david

Sign In or Register to comment.