Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

VQSR for small exome data sets

KlausNZKlausNZ Posts: 28Member

Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).

Hopefully I'm missing an important point somewhere? Many thanks in advance, K

Best Answer


  • SamSamSamSamSamSam HoustonPosts: 2Member

    To follow-up on KlausNZ's first question, let's say I add 27 samples from 1000g to my trio (to get a total of 30 samples). So I will have 30 input bam files when I run UnifiedGenotyper. In this case would I have to input all the 30 bam files as a list file to get one large vcf output file with the variants from all 30 samples? As I am quite new to the GATK, I have noticed that whenever I have a multisample vcf file output from UnifiedGenotyper then there is only a single "INFO" field (i.e. the field that has the BaseQRankSum, HaplotypeScore, MQ, etc values) . But I have no idea which sample this INFO field pertains to? In my multisample vcf file from the 30 samples I mentioned, how do I know which values are for my experimental sample and which are for the 1000g samples? Many thanks.

  • SamSamSamSamSamSam HoustonPosts: 2Member

    Follow-up to my follow-up question.....assuming that I successfully run the VQSR with additional samples, what value of the VQSLOD is recommended as the cutoff for selecting variants with a high likelihood of being true?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi @SamSamSam,

    The INFO field metrics refer to the site itself and how likely it is that there is variation there in at least one of the samples. For sample-specific metrics, you need to look at the FORMAT field, which gives genotypes and various metrics for each sample.

    There is no single recommended cutoff value of VQSLOD; that's the whole point of generating different tranches and looking at the specificity and sensitivity levels obtained against the training sites. Be sure to read all the docs and watch the videos to understand how VQSR works.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.