Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Extracting consensus variants from a VCF with 27 RNA-seq samples from the same genotype

Is there a tool, or recommended best practice for generating a consensus set of variants from multiple samples of the same genotype? In short I have 27 RNA libraries from different individuals and different tissues, and different sequencing lanes, but all from the same genotype, and I analyzed them following the RNA best practices listed and using the gVCF/HaplotypeCaller (I understand this is unsupported, but it seemed the most appropriate). Then end result is a VCF with 27 “columns” for each SNP, one for each sample (for instance root_1, root_2, leaf_1, leaf_2, etc). I would like to generate a VCF with a single column, combining the information for all the samples. Based on the website descriptions, it seems like CombineVariants is not appropriate, and I cannot see a way to do it with SelectVariants. It is perhaps complex as, for a given SNP, different samples, although from the same genotype, may have different alleles, as they are from different individuals – I would prefer to select the most common variant if possible. My downstream goal is to generate a new reference genome for the genotype that all of the 27 samples are derived form.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rwatson
    Hi,

    Can you explain a bit more what you mean by "all from the same genotype"? You may find this thread helpful.

    If you would like to select the most common variant, I think the best/easiest thing to do is select for multiallelic sites, use VariantsToTable to get the AC/AF fields and use R to get the highest AF/AC. First you can select for multiallelics using --restrictAllelesTo. That will give you only the sites where there are multiple variant alleles. Then, you can use VariantsToTable with --fields to make a table with AC/AF. You can then use R or Excel to select for the highest value.

    I hope that helps.

    -Sheila

Sign In or Register to comment.