The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# VQSR for multi-sample VCF

Member Posts: 1

Hi,

I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG).
Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.

Many thanks.

Tagged:

• Member Posts: 103

Hi, Geraldine:

I got almost similar questions. my VQSR step is on a multi-sample vcf file, which was generated from Unified genotype by calling variants from pooled bam files of many samples. I noticed that for the vcf file after VQSR step, the FILTER column has either "PASS". or something like " VQSRTrancheINDEL99.00to99.90" or "VQSRTrancheINDEL90.00to99.00" for some variants, but this vcf file has multiple samples, I am trying to understand how each individual sample amongst the group affect the final assignment of such "Filter". In other words, let's see, the vcf file has 50 samples, for a given variant site, if 25 of the samples doing great in quality or whatever metrics that VQSR assessed, but 25 of the other samples not doing so great on this site (e.g. quality issues of reads or alignment here), if assigned PASS to this variant, the 25 good samples would be reasonable, but for the other 25 samples seems not reasonable. If vice versa, assigned VQSRTrancheINDEL90.00to99.00 (not PASS) to this variant site, for the half good samples, it seems not fair. Imagine if we just take the 25 good samples together, and just the bad 25 samples as a group to call variants separately as 2 groups, the 25-good group would have the variant at this site (PASS) and the 25 bad sample group would be flagged as not PASS. So my question is when pooled samples together, how VQSR made decision to call the site as PASS or non-PASS? Maybe my question is out of track here, but I just try to understand how VQSR deal with such situation. Also I noticed many variant with one or more samples as "./." as genotype by Unified genotyper seem tending to be flagged as Non-PASS, is it true? In other words, the ones flagged as PASS seem not having any ./. genotype in any of the samples in vcf file.

Thanks a lot!

Mike

Hi Mike,

I think what's confusing you here is that you expect the VariantRecalibrator to judge whether individual sample calls are good or bad, but that's not really the case. Its main purpose is to determine, for a given site, whether there is evidence that the site is really variant in one or more samples. If the site passes the filter, it is then up to you to evaluate whether some of the samples are not really variant at that site.

A genotype of "./." means that the caller could not decide either way whether the sample was variant or not, and so it marked it as a "no-call". This really only happens when there is no useable data for that sample. In general a high degree of missingness is a sign that the variant isn't real (in which case the variant fails the filter, and is not marked as PASS) so the correlation you've noticed is probably real.

Geraldine Van der Auwera, PhD

• Member Posts: 65

Could I ask an following-up question, Geraldine? You said “It is then up to you to evaluate whether some of the samples are not really variant at that site”, could you give some suggestions how to evaluate, check GQ? Thanks.