Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

How to extract annotation for a single sample after genotyping a combined gVCF file?

I have been running the GATK best practices for annotating numerous exome datasets. I have created gVCFs for each exome, combined into a single datafile, genotyped the datafile, and now have a combined .vcf file. However, each SNP has data for a variety of exomes. How can I extract a single exome's worth of annotation from this new .vcf file before running downstream VQSR filtering.


Best Answer


  • SheilaSheila Broad InstituteMember, Broadie admin


    Do you want to extract INFO annotations or FORMAT annotations? INFO annotations are calculated for all samples, but FORMAT annotations are calculated per-sample. VQSR only uses INFO annotations. Do you only want to use one sample's annotations for VQSR? We don't recommend doing that. The more information from the samples, the better for building a model.


  • jtshrevejtshreve Member

    Hi Sheila,

    I would like to leverage the entire group's data, so I will use INFO for VQSR. However, after VQSR filtering, how will I extract just the annotation for a particular exome? For example, I have 51 exomes, 50 are from 1000 genome project, and 1 is experimental. I have completed all steps before VQSR (I have created 51 .gvcf files, combined them, and genotyped them). Now I would like to use VQSR filtering and determine the high confidence SNPs for my 1 experimental exome. Can you please help me understand what I will need to do?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @jtshreve I think what we're confused about here is what do you mean by "extract just the annotation for a particular exome"? Do you mean you want to generate a separate VCF for each exome sample?

    If so, what you need to do is this: run VQSR on the multisample VCF you got from GenotypeGVCFs, then subset down to individual samples. You can do this using SelectVariants (subsetting by eg sample name, with -sn) or VariantsToTable if you prefer a table format rather than a VCF.

  • jtshrevejtshreve Member

    Hi Geraldine,

    I guess I'm confused about a particular point of this workflow. Please indicate which of the following examples is correct:

    1) I am using my 51 genomes to create 51 GVCF files which are then combined and genotyped into a single "total.vcf" file. Later, I will run VQSR using my 1 experimental exome's .vcf (not gvcf) as input and "total.vcf" as a training resource. This will leverage the 51 exome combination as a training set and I will get a highly filtered set of SNPs for my experimental exome as output.

    2) I am using my 51 genomes to create 51 GVCF files which are then combined and genotyped into a single "total.vcf" file. Later, I will run VQSR using this "total.vcf" as input and the training resources listed in the documentation. This will leverage both the 51 exome combination and the resources training sets and I will get a highly filtered set of SNPs as output. I will then need to run SelectVariants with my 1 experimental exome's sample name to extract just those high quality SNPs that pertain to my experimental exome.

    I'm not having any computational difficulty accomplishing these steps, just conceptual. Thanks again for your assistance.


Sign In or Register to comment.