We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
VQSR for small exome data sets

Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).
Hopefully I'm missing an important point somewhere?
Many thanks in advance,
K
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
Hi Klaus,
Don't be afraid of criticizing the Best Practices document -- it is far from perfect and we are currently working on rewriting it more clearly.
Regarding your questions:
1) The 1000G samples should be included in the entire calling process, not just VQSR. This adds quite a bit of overhead of course, but it's necessary for the variant annotations to "fit".
2) The idea is to minimize differences by choosing samples from a similar cohort, e.g. match samples by geographical/ethnic origin and any other property possible. It is also better if they were sequenced using a similar platform (so the error modes will be similar).
Note that this is not a silver bullet approach -- there are all sorts of things that may prevent you from following it, such as being unable to match up your samples to a similar cohort, having used a radically different sequencing tech... and it really only works for exomes (or low-pass whole genomes).
So you may need to find other data than 1000G, or there may not exist any data that you can use to do this. In that case you'll probably have to fall back on the alternative to VQSR, which is hard-filtering.
Answers
Hi Klaus,
Don't be afraid of criticizing the Best Practices document -- it is far from perfect and we are currently working on rewriting it more clearly.
Regarding your questions:
1) The 1000G samples should be included in the entire calling process, not just VQSR. This adds quite a bit of overhead of course, but it's necessary for the variant annotations to "fit".
2) The idea is to minimize differences by choosing samples from a similar cohort, e.g. match samples by geographical/ethnic origin and any other property possible. It is also better if they were sequenced using a similar platform (so the error modes will be similar).
Note that this is not a silver bullet approach -- there are all sorts of things that may prevent you from following it, such as being unable to match up your samples to a similar cohort, having used a radically different sequencing tech... and it really only works for exomes (or low-pass whole genomes).
So you may need to find other data than 1000G, or there may not exist any data that you can use to do this. In that case you'll probably have to fall back on the alternative to VQSR, which is hard-filtering.
To follow-up on KlausNZ's first question, let's say I add 27 samples from 1000g to my trio (to get a total of 30 samples). So I will have 30 input bam files when I run UnifiedGenotyper. In this case would I have to input all the 30 bam files as a list file to get one large vcf output file with the variants from all 30 samples? As I am quite new to the GATK, I have noticed that whenever I have a multisample vcf file output from UnifiedGenotyper then there is only a single "INFO" field (i.e. the field that has the BaseQRankSum, HaplotypeScore, MQ, etc values) . But I have no idea which sample this INFO field pertains to? In my multisample vcf file from the 30 samples I mentioned, how do I know which values are for my experimental sample and which are for the 1000g samples? Many thanks.
Follow-up to my follow-up question.....assuming that I successfully run the VQSR with additional samples, what value of the VQSLOD is recommended as the cutoff for selecting variants with a high likelihood of being true?
Hi @SamSamSam,
The INFO field metrics refer to the site itself and how likely it is that there is variation there in at least one of the samples. For sample-specific metrics, you need to look at the FORMAT field, which gives genotypes and various metrics for each sample.
There is no single recommended cutoff value of VQSLOD; that's the whole point of generating different tranches and looking at the specificity and sensitivity levels obtained against the training sites. Be sure to read all the docs and watch the videos to understand how VQSR works.
@Geraldine_VdAuwera Hi! I'd like to add a question related to using VQSR on small exome data sets if I may. I have 12 exomes, so I downloaded data for 18 more from 1000G. I made sure to select data with matching ethnicity and sequencing platform, and started from fastq so the processing pipeline would be the same as for my exomes. My question then: if the capture technology for the 1000G exomes and my own data are different, must I then limit the variant call sets to the intersection of the two capture method target intervals for VQSR to work properly?
Much thanks!
-et
Hi @erikt,
This is not something we've tested explicitly. Generally speaking I would say the safest course of action is to restrict the analysis to the intersection, but it is probably okay to use the entire callset. The program will only use a subset of the intersection to train the model anyway, and assuming that the "different" portions are not subject to distinct error modes, the resulting model should be applicable to them as well.