Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

VQSR for multi-sample VCF

eranmickeranmick Posts: 1Member

Hi,

I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG).
Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.

What is the best way to go about this?

Many thanks.

Tagged:

Best Answer

Answers

  • mikemike Posts: 103Member

    Hi, Geraldine:

    I got almost similar questions. my VQSR step is on a multi-sample vcf file, which was generated from Unified genotype by calling variants from pooled bam files of many samples. I noticed that for the vcf file after VQSR step, the FILTER column has either "PASS". or something like " VQSRTrancheINDEL99.00to99.90" or "VQSRTrancheINDEL90.00to99.00" for some variants, but this vcf file has multiple samples, I am trying to understand how each individual sample amongst the group affect the final assignment of such "Filter". In other words, let's see, the vcf file has 50 samples, for a given variant site, if 25 of the samples doing great in quality or whatever metrics that VQSR assessed, but 25 of the other samples not doing so great on this site (e.g. quality issues of reads or alignment here), if assigned PASS to this variant, the 25 good samples would be reasonable, but for the other 25 samples seems not reasonable. If vice versa, assigned VQSRTrancheINDEL90.00to99.00 (not PASS) to this variant site, for the half good samples, it seems not fair. Imagine if we just take the 25 good samples together, and just the bad 25 samples as a group to call variants separately as 2 groups, the 25-good group would have the variant at this site (PASS) and the 25 bad sample group would be flagged as not PASS. So my question is when pooled samples together, how VQSR made decision to call the site as PASS or non-PASS? Maybe my question is out of track here, but I just try to understand how VQSR deal with such situation. Also I noticed many variant with one or more samples as "./." as genotype by Unified genotyper seem tending to be flagged as Non-PASS, is it true? In other words, the ones flagged as PASS seem not having any ./. genotype in any of the samples in vcf file.

    Thanks a lot!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Hi Mike,

    I think what's confusing you here is that you expect the VariantRecalibrator to judge whether individual sample calls are good or bad, but that's not really the case. Its main purpose is to determine, for a given site, whether there is evidence that the site is really variant in one or more samples. If the site passes the filter, it is then up to you to evaluate whether some of the samples are not really variant at that site.

    A genotype of "./." means that the caller could not decide either way whether the sample was variant or not, and so it marked it as a "no-call". This really only happens when there is no useable data for that sample. In general a high degree of missingness is a sign that the variant isn't real (in which case the variant fails the filter, and is not marked as PASS) so the correlation you've noticed is probably real.

    Geraldine Van der Auwera, PhD

  • ying_sheng_1ying_sheng_1 Posts: 29Member

    Could I ask an following-up question, Geraldine? You said “It is then up to you to evaluate whether some of the samples are not really variant at that site”, could you give some suggestions how to evaluate, check GQ? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    To be honest, evaluation is the hardest part, because there is no one-size-fits-all formula. GQ should give you a strong indication. But sometimes you might need to actually look at the pileup of bases to see what the data looks like. Hopefully, the higher your data quality is, the less you will need to go in and check things manually.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    I should say also, people in the community are most welcome to share their favorite tips and tricks evaluating variants and genotype calls. This is the part where we typically hand the data over to analysts who do the actual interpretation of the call set data, so we are not in the best position to give advice. But it's an important topic and we are more than happy to help facilitate the conversation.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.