Extracting consensus variants from a VCF with 27 RNA-seq samples from the same genotype

Is there a tool, or recommended best practice for generating a consensus set of variants from multiple samples of the same genotype? In short I have 27 RNA libraries from different individuals and different tissues, and different sequencing lanes, but all from the same genotype, and I analyzed them following the RNA best practices listed and using the gVCF/HaplotypeCaller (I understand this is unsupported, but it seemed the most appropriate). Then end result is a VCF with 27 “columns” for each SNP, one for each sample (for instance root_1, root_2, leaf_1, leaf_2, etc). I would like to generate a VCF with a single column, combining the information for all the samples. Based on the website descriptions, it seems like CombineVariants is not appropriate, and I cannot see a way to do it with SelectVariants. It is perhaps complex as, for a given SNP, different samples, although from the same genotype, may have different alleles, as they are from different individuals – I would prefer to select the most common variant if possible. My downstream goal is to generate a new reference genome for the genotype that all of the 27 samples are derived form.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Can you explain a bit more what you mean by "all from the same genotype"? You may find this thread helpful.

    If you would like to select the most common variant, I think the best/easiest thing to do is select for multiallelic sites, use VariantsToTable to get the AC/AF fields and use R to get the highest AF/AC. First you can select for multiallelics using --restrictAllelesTo. That will give you only the sites where there are multiple variant alleles. Then, you can use VariantsToTable with --fields to make a table with AC/AF. You can then use R or Excel to select for the highest value.

    I hope that helps.


