How to extract annotation for a single sample after genotyping a combined gVCF file?

I have been running the GATK best practices for annotating numerous exome datasets. I have created gVCFs for each exome, combined into a single datafile, genotyped the datafile, and now have a combined .vcf file. However, each SNP has data for a variety of exomes. How can I extract a single exome's worth of annotation from this new .vcf file before running downstream VQSR filtering.

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jtshreve
    Hi,

    Do you want to extract INFO annotations or FORMAT annotations? INFO annotations are calculated for all samples, but FORMAT annotations are calculated per-sample. VQSR only uses INFO annotations. Do you only want to use one sample's annotations for VQSR? We don't recommend doing that. The more information from the samples, the better for building a model.

    -Sheila

  • Hi Sheila,

    I would like to leverage the entire group's data, so I will use INFO for VQSR. However, after VQSR filtering, how will I extract just the annotation for a particular exome? For example, I have 51 exomes, 50 are from 1000 genome project, and 1 is experimental. I have completed all steps before VQSR (I have created 51 .gvcf files, combined them, and genotyped them). Now I would like to use VQSR filtering and determine the high confidence SNPs for my 1 experimental exome. Can you please help me understand what I will need to do?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @jtshreve I think what we're confused about here is what do you mean by "extract just the annotation for a particular exome"? Do you mean you want to generate a separate VCF for each exome sample?

    If so, what you need to do is this: run VQSR on the multisample VCF you got from GenotypeGVCFs, then subset down to individual samples. You can do this using SelectVariants (subsetting by eg sample name, with -sn) or VariantsToTable if you prefer a table format rather than a VCF.

  • Hi Geraldine,

    I guess I'm confused about a particular point of this workflow. Please indicate which of the following examples is correct:

    1) I am using my 51 genomes to create 51 GVCF files which are then combined and genotyped into a single "total.vcf" file. Later, I will run VQSR using my 1 experimental exome's .vcf (not gvcf) as input and "total.vcf" as a training resource. This will leverage the 51 exome combination as a training set and I will get a highly filtered set of SNPs for my experimental exome as output.

    2) I am using my 51 genomes to create 51 GVCF files which are then combined and genotyped into a single "total.vcf" file. Later, I will run VQSR using this "total.vcf" as input and the training resources listed in the documentation. This will leverage both the 51 exome combination and the resources training sets and I will get a highly filtered set of SNPs as output. I will then need to run SelectVariants with my 1 experimental exome's sample name to extract just those high quality SNPs that pertain to my experimental exome.

    I'm not having any computational difficulty accomplishing these steps, just conceptual. Thanks again for your assistance.

    Jacob

Sign In or Register to comment.