VQSR for small exome data sets

KlausNZKlausNZ Posts: 28Member

Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).

Hopefully I'm missing an important point somewhere? Many thanks in advance, K

Best Answer

Answers

  • SamSamSamSamSamSam HoustonPosts: 2Member

    To follow-up on KlausNZ's first question, let's say I add 27 samples from 1000g to my trio (to get a total of 30 samples). So I will have 30 input bam files when I run UnifiedGenotyper. In this case would I have to input all the 30 bam files as a list file to get one large vcf output file with the variants from all 30 samples? As I am quite new to the GATK, I have noticed that whenever I have a multisample vcf file output from UnifiedGenotyper then there is only a single "INFO" field (i.e. the field that has the BaseQRankSum, HaplotypeScore, MQ, etc values) . But I have no idea which sample this INFO field pertains to? In my multisample vcf file from the 30 samples I mentioned, how do I know which values are for my experimental sample and which are for the 1000g samples? Many thanks.

  • SamSamSamSamSamSam HoustonPosts: 2Member

    Follow-up to my follow-up question.....assuming that I successfully run the VQSR with additional samples, what value of the VQSLOD is recommended as the cutoff for selecting variants with a high likelihood of being true?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi @SamSamSam,

    The INFO field metrics refer to the site itself and how likely it is that there is variation there in at least one of the samples. For sample-specific metrics, you need to look at the FORMAT field, which gives genotypes and various metrics for each sample.

    There is no single recommended cutoff value of VQSLOD; that's the whole point of generating different tranches and looking at the specificity and sensitivity levels obtained against the training sites. Be sure to read all the docs and watch the videos to understand how VQSR works.

    Geraldine Van der Auwera, PhD

  • erikterikt Posts: 7Member

    @Geraldine_VdAuwera‌ Hi! I'd like to add a question related to using VQSR on small exome data sets if I may. I have 12 exomes, so I downloaded data for 18 more from 1000G. I made sure to select data with matching ethnicity and sequencing platform, and started from fastq so the processing pipeline would be the same as for my exomes. My question then: if the capture technology for the 1000G exomes and my own data are different, must I then limit the variant call sets to the intersection of the two capture method target intervals for VQSR to work properly?

    Much thanks!

    -et

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi @erikt,

    This is not something we've tested explicitly. Generally speaking I would say the safest course of action is to restrict the analysis to the intersection, but it is probably okay to use the entire callset. The program will only use a subset of the intersection to train the model anyway, and assuming that the "different" portions are not subject to distinct error modes, the resulting model should be applicable to them as well.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.