meaning of 'unified' in UnifiedGenotyper
The technical docs in UnifiedGenotyper state:
The GATK Unified Genotyper is a multiple-sample, technology-aware SNP and indel caller. It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting an accurate posterior probability of there being a segregating variant allele at each locus as well as for the genotype of each sample.
When run with multiple -I inputs, in what ways does the UG algorithm utilize the multiple samples? Does it merely assess each locus in each sample based on the pileup there (and possibly window around it) in isolation, or does it somehow take into account the other samples? Does running it with multiple samples enable it to somehow find variants in some samples that it wouldn't otherwise find, if run on each sample individually?
While I do realize that running in multisample mode allows it to output a nice multi-column VCF format of integrated sample information, is this anything more than just a formatting convenience, the same result that you would achieve if you did the following:
- run N individual runs of UG, each with just a single sample
- collect the union of all loci that have a variant in at least one sample
- re-run UG on each of the N individual samples, and ask it to output calls for the union of the loci in step 2.
- merge the results into a single, multicolumn VCF file.