To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Extracting consensus variants from a VCF with 27 RNA-seq samples from the same genotype

Is there a tool, or recommended best practice for generating a consensus set of variants from multiple samples of the same genotype? In short I have 27 RNA libraries from different individuals and different tissues, and different sequencing lanes, but all from the same genotype, and I analyzed them following the RNA best practices listed and using the gVCF/HaplotypeCaller (I understand this is unsupported, but it seemed the most appropriate). Then end result is a VCF with 27 “columns” for each SNP, one for each sample (for instance root_1, root_2, leaf_1, leaf_2, etc). I would like to generate a VCF with a single column, combining the information for all the samples. Based on the website descriptions, it seems like CombineVariants is not appropriate, and I cannot see a way to do it with SelectVariants. It is perhaps complex as, for a given SNP, different samples, although from the same genotype, may have different alleles, as they are from different individuals – I would prefer to select the most common variant if possible. My downstream goal is to generate a new reference genome for the genotype that all of the 27 samples are derived form.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rwatson
    Hi,

    Can you explain a bit more what you mean by "all from the same genotype"? You may find this thread helpful.

    If you would like to select the most common variant, I think the best/easiest thing to do is select for multiallelic sites, use VariantsToTable to get the AC/AF fields and use R to get the highest AF/AC. First you can select for multiallelics using --restrictAllelesTo. That will give you only the sites where there are multiple variant alleles. Then, you can use VariantsToTable with --fields to make a table with AC/AF. You can then use R or Excel to select for the highest value.

    I hope that helps.

    -Sheila

Sign In or Register to comment.