The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Variant Quality Score Recalibration (VQSR)

Posts: 122GATK Developer mod
edited June 2014

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

### Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

• VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

• ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

#### How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

#### How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

## Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

### Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels.

### Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

• The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
• The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
• The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

### Finally, a couple of Frequently Asked Questions

#### - Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

#### - Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.

Post edited by Geraldine_VdAuwera on
Tagged:

• Broad InstitutePosts: 684Member, Administrator, GATK Developer, Broadie, Moderator, DSDE Member, GP Member admin

@mprasad in the past when users have reported such low Ti/Tv ratios for exomes it was because they forgot to restrict the calling to just the exome target regions (+/- some small padding). If you are calling variants outside these regions they will mostly be false positives.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

@vsvinti, see @ebanks' comment to @mprasad above. I think this might also apply to you; it would explain why your titv ratio is lower than expected. Did you restrict your analysis to target intervals using the -L argument?

Geraldine Van der Auwera, PhD

• Epilepsy Genomics CenterPosts: 7Member

Hi Geraldine,
How do I check to make sure they are using the same target regions? In the 1000 Genomes' Data directory, they have bam files for exome alignments for each sample, also listing the machine, aligner and ethnic group, but no reference to a BED file or specific enrichment kit. Or is it sufficient that I use exome sequencing data (rather than whole genome)? You mentioned above that the -L interval option should be used at each step of preprocessing and variant calling/recalibration. The BAM files that we've received have already gone through the preprocessing steps in your best practices. Should I go back redo the preprocessing on not just our samples, but the 1000 Genomes as well to make sure that they all are processed in the same way before attempting to recalibrate using the two groups?

Chris

Hi Chris,

I think the 1000G are all done with the same exome targets, so that should be in the description of the experimental design. If your data is already generated (ie you are post-prep) there's not much to be done; just check whether your data used the same target list, and if not, generate a new list from the intersection.

The only pre-processing step that really benefits from using the target list (beyond runtime) is BQSR, so you might want to redo just that on your data. 1000G data is processed according to our Best Practices so no real need to reprocess

Geraldine Van der Auwera, PhD

• Posts: 4Member

@Geraldine_VdAuwera Could I ask, is VQSR meant to be run on a pooled vcf? Ie - for example if i have 20 vcf (since Haplotype was run individually on each sample [20 cancer cell line] ), should I ;

1) Combine the 20 vcf into one and run VQSR
OR
2) Run VQSR separately on each individual vcf?

If possible, could you also please comment if not sure if I should run Hapotype individually or pooled since these 20 cancer cell lines are from the same cancer type but different individuals...

Many thanks

@Zaki, VQSR is designed to be run on multisample VCFs of samples that were called together. So you should run HaplotypeCaller on all your samples together then run VQSR on the resulting VCF file.

FYI "pooling" refers to something different than combining sample calls in a single file.

Geraldine Van der Auwera, PhD

• Posts: 4Member

@Geraldine_VdAuwera Thank you very much for your kind clarification.

Could I extend my question to get an important concept corrected...

I am slightly confused how would one decide what samples to combine in order to run Haplotype Caller, Is there a rule of thumb on how samples should be combined?

In other words - If one were to have 10 tumour sample (same tumour type) & 3 normal (non-matched normal). Would it be recommended to run Haplotype caller in two batch; First on the tumour , Second on the normals? Or just run Haplotype caller on all 13 samples...

Also, perhaps due to my inexperience in statistics in general as well as the statistics behind Haplotype caller.. But i fail to understand the difference if Haloptype is run on single sample v.s if it is run on a combined sample.. http://www.broadinstitute.org/gatk/events/3391/GATKw1310-BP-5-Variant_calling.pdf (mainly slide #12) helped, but a layman explanation would always be helpful.

P/s - When would the big 3.0 schedule to be released? As I am doing this all for RNA-seq

• Posts: 47Member

I have used the -L argument all throughout the pipeline to restrict to target intervals. However, in this particular analysis, the data came from different capture platforms - and I used the union of those intervals. This might explain it. Ti/Tv was indeed a bit higher when running on samples captured with one platform only.

@Geraldine_VdAuwera said:
vsvinti, see ebanks' comment to mprasad above. I think this might also apply to you; it would explain why your titv ratio is lower than expected. Did you restrict your analysis to target intervals using the -L argument?

I would recommend calling all your samples together (tumor + normals). The advantages of multisample calling are described here. In addition to these advantages, when you have samples that were called together, you have more data aggregated for the VQSR step (which can't be run on samples that were called separately).

Geraldine Van der Auwera, PhD

That makes sense then. Note that in that situation we typically would take the intersection of the intervals rather than the union.

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Thanks for the suggestion @ebanks . I did include an interval list. However, I realized it was not the correct one- it was a larger interval than the enrichment kit that was used. That likely explains the low Ti/Tv ratio.
However, I have since repeated the analysis with the correct interval list on several other samples and each time I end up with a low Ti/Tv ratio for novel variants (known Ti/Tv is closer to expected between 2.5 and 2.7). I think this is because most of my exomes were enriched using newer versions of the Agilent SureSelect kit than that used for 1000G (this I believe is v2 from the 1000G website), hence the targeted regions are larger for my exomes than for the 1000G exomes (50 Mb for my samples vs ~45 Mb for the 1000G exomes). I am using HaplotypeCaller and VQSR with 29 BAM files from 1000G as I don't have enough exomes for VQSR).
One option would be to use an intersection of the 2 interval lists like Garaldine has pointed out; however, won't I then be missing out on calling variants in all the extra enriched regions of my exomes where I expect to find real variants? If I continue using the larger interval list in order to identify new variants in the extra regions of my exomes, will the fact that there aren't many variants in these regions in the 1000G exomes cause VQSR to attribute a low score to these variants? In such a case should I revert back to hard filtering or rather try and use arguments such as --numBad and --maxGaussians using only my sample exomes?
I' much appreciate any thoughts/advice.
Thanks again for all your help,

Megana

• FrancePosts: 13Member

Just an update to my post above- I tried the running VQSR on my samples only and things are looking even funnier!
I extracted the variants sites from my 3 samples from the vcf file that I generated using HC and the 29 other 1000G bam files and used the resulting vcf file to run VQSR with the --maxGaussians 4 argument. Although there seems to be a slight improvement in the Ti/Tv ratio, the tranche plot looks really unusual. Any ideas would be welcome :-)

Thanks,
Megana

That looks like the tranches data are not being processed/plotted in the right order. Can you post your command line?

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Yes, of course; here it is :

java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T VariantRecalibrator -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -input Famb.rawvariants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/dbsnp_138.b37.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile Famb.VQSR.SNP.recal -tranchesFile Famb.VQSR.SNP.tranches -rscriptFile Famb.VQSR.SNP.plots.R --maxGaussians 4

Alright, that looks reasonable. Although -- are you saying you're running VQSR only on your samples, excluding the 100G ones you called them with? If so, I'm not sure that's why your results look weird, but it's definitely not recommended. The point of adding the other samples for calling is to have their variant evidence as well for VQSR, so you should recalibrate all the called variants together. Can you try that and let me know whether your results look more sane?

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Hi Geraldine,

Thanks for your response. You are right, I ran VQSR on my samples only to see if I could improve the ti/tv ratio corresponding to the tranches in the tranche plot. I think this may be the problem as you say as running VQSR with the entire call set (my samples + 1000 genomes) gives a reasonable tranche plot (please see attached), albeit with a low ti/tv and large number of false positives, which made me wonder if there is something wrong with my parameters.

I think the low ti/tv is in large part due to differences in the enrichment kit used for my samples and the 1000G samples (I used the interval list corresponding to the union of the enrichment kit used on my sample and 1000G). I think that most of the false positives are coming from the areas that are called but not actually enriched in the 1000G samples. (I am currently testing this idea by calling variants with the intersected interval list on the same samples, with which I expect to have a more reasonable ti/tv).

After fretting a lot about whether I should instead use an intersected interval file for my pipeline and spending time watching the workshop videos on VQSR (which are awesome btw!) to better understand it, I came to the conclusion that I should just continue using the union interval list since I am interested in rare medically relevant mutations in my samples. Since the 1000G samples are just "fillers", the false positives in these samples and the resulting low ti/tv shouldn't bother me so much as the VQSLOD scores and tranche cut-off should still be valid. Does this sound like a reasonable strategy to you? I just want to make sure I am not way off-base here.

Thanks again so much for the wonderful job you guys are doing!

Megana

Ah, good to hear! Those plots look much better.

If your enrichment intervals are a subset of the intervals used in 1000G I would definitely take the intersection rather than the union. If there are regions in your samples that aren't in the 1000G intervals, add those back in. But I would get rid of whatever intervals are 1000G-only; they're probably harmless, but why risk it (and waste the processing time and disk space) since they're no use to you anyway.

Geraldine Van der Auwera, PhD

• Cambridge, MAPosts: 9Member

Hi, Geraldine

Do you happen to know which tranche is used to make the filtering plots (e.g., the most stringent, or the least stringent)? I'm asking because I wasn't sure if most of the filtered/retained variants pictured in the plots are filtered or retained on the basis of their tranche, or whether they are such extreme outliers that they are filtered regardless of which tranche is specified.

Thanks!

Grace

• Posts: 7Member

Hi @Geraldine_VdAuwera et al.,

I've been using VQSR on a couple few thousand exomes (uniformly processed) that were called with Unified Genotyper supplying the target list via -L. These samples are of mixed ethniciteis, but are mostly Caucasian. I've tried a few different parameter configurations, basing them off of @rpoplin 's best practices document and the VQSR video, which as others have mentioned is a great resource! Nonetheless, I'm trying to wrap my head around interpreting the some of the distribution plots.

1. When running with multiple -an arguments, is it reasonable to expect all plots to display clear clustering? Are there perhaps some -an arguments that tend not to perform as well with clustering compared to some of the others? For example, the first two plots in the attached pdf are not terrible (not particularly good either I don't think), but the last two do not show great clustering. I'm sure there isn't any hard and fast rule, but are there any guidelines for determining, given the data, the cleanest set of calls VQSR will be able to retain?

2. Let's say for example that all of the plots that have QD on their axis look rather poor. Now, presumably one way we could obtain more visually appealing plots is to not use -an QD. However, is this really an appropriate course of action for this situation (off-the-cuff I'd think no)? Besides minNumBad, are there any other parameters I should tinker with? Perhaps even some parameters with Unified Genotyper?

Any thoughts you, the GATK team, or other users have would be very much appreciated!

Fwiw, I am seeing lower ti/tv ratios like some of the others here. My known variant ti/tv ratios are all around 3.2 as expected, but my novel ranges from ~1.25 – 1.8 (100.0 and 90.0 tranches respectively).

Here are the arguments I used. Due to the issues I've read about regarding VQSR in GATK 2.7 I've been using 2.5.2.

Program Args: -T VariantRecalibrator -R /lustre/beagle/pittjj/reference/GATK_Bundle_2.5/37_partial/human_g1k_v37.fasta -input ../reduced_SNP_merged_sort.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /lustre/beagle/pittjj/reference/GATK_Bundle_2.5/37_partial/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /lustre/beagle/pittjj/reference/GATK_Bundle_2.5/37_partial/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /lustre/beagle/pittjj/reference/GATK_Bundle_2.5/37_partial/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /lustre/beagle/pittjj/reference/GATK_Bundle_2.5/37_partial/dbsnp_137.b37.vcf -an DP -an QD -an FS -an MQRankSum -an InbreedingCoeff -an ReadPosRankSum -an HaplotypeScore --target_titv 3.2 -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile reduced_SNP_merged_sort_SNP.recal -tranchesFile reduced_SNP_merged_sort_SNP.tranches -rscriptFile reduced_SNP_merged_sort_SNP_plots.R

Hi @jpitt,

I wouldn't recommend just dropping QD to make the plots look prettier; that's just ignoring the potential problems. Especially since QD is usually a pretty solid metric. You could try plotting the distribution of each annotation individually to see if anything looks funky in that respect. It may be useful to break things down per sample to see if maybe you have a subpopulation of samples that are problematic and are messing up the overall properties.

Geraldine Van der Auwera, PhD

Hi @gtiao,

As I recall the "retained"variants are those in the least stringent tranches, and the rest are shown as "filtered" to varying degrees indicated by the color intensity, corresponding to the other tranches.

Geraldine Van der Auwera, PhD

• Posts: 47Member

Hi @rpoplin

Wondering if I should still be waiting for a response on my posts from Feb 13th?

@vsvinti said:
rpoplin , I have uploaded vsvinti_test5VQSR.logfile.log to the ftp server. I will wait a couple of weeks for response. Thanks.

• Posts: 122GATK Developer mod

It sounds like different capture targets are being used for your single exome sample and the 30 exome samples from 1000 Genomes that were spiked in. This could definitely cause a problem when trying to interpret the ti/tv ratio of the results. What happens if you were to run with just the 30 samples that are all using the same target list? This will let us decouple if it is a problem with the algorithm or with the data.

Cheers,

• Posts: 47Member

@rpoplin, was your message for me?

You had wanted to look at the VQSR plots and the log files that came with the VQSR 2.8 run -- which is why I had uploaded them. I had used it for the results on UG calling on ~ 2,000 samples.

The ti/tv was a separate issue, I think caused by calling on the union of capture platforms.

• Posts: 122GATK Developer mod

I think I've lost track of what the actual problem is then. The VQSR plots and logs look perfectly reasonable to me. I had thought we were trying to track down why the ti/tv is lower than expected with your one exome sample. What is the problem we are trying to solve? Perhaps it makes more sense to create a thread for this issue in the "ask the team" section of the forum instead of here in the VQSR documentation page.

Cheers,

• Posts: 47Member

@‌rpoplin

I had several issues with VQSR not working on my data. I downgraded to v2.5, which worked. After release of 2.8, I tried it and asked Geraldine's opinion on whether 2.8 was better than 2.5. I only posted one page of the plots.

We sort of agreed that 2.8 looks a bit better, but she asked for your opinion. You wanted to have a look at the rest of the plots, and the logs, to see what's happening with my data. If you also think my data looks good - then we have no more issues.

• Colombo Sri LankaPosts: 11Member

Hey guys,
I'm trying to use variant recalibrator to my raw vcf, which i have generated using the Haplotype Caller. Input training sets were downloaded from the GATK resource bundle. But it throws an error saying,

##### ERROR MESSAGE: Bad input: Values for MQRankSum annotation not detected for ANY training variant in the input callset.

Any suggestions?

Hi @Nilaksha,

Check that you're using the right resources that match your reference build. If you're using e.g. an hg19 resource file against a b37 callset, the contig names are different, so the recalibrator won't find any overlapping sites. Unfortunately the error is misleading right now; we'll fix that in the next version.

Geraldine Van der Auwera, PhD

• DEPosts: 23Member

Hello,

I've got a question about my post-VQSR filtered variants and interpreting the tranches PDF.

I'm working with whole-genome sequence data for an inbred mouse strain at about 10x coverage. HaplotypeCaller calls about 6mio variants compared to the mm10 reference. Running a Ti/Tv analysis on the full set gives me a ratio of 1.99 (using vcftools) - so just about inside the "expected" range, given my data.

Of these, ~4mio are already known (and have been annotated with rsIDs). The VQSR tranches plot, however, suggests that that the Ti/Tv ratio is significantly lower for the ~2mio novel variants (see tranches plot, attached; I've also attached the VQSR plot for SNPs).

My question is: should I be concerned about this low Ti/Tv ratio for the novel variants? Thanks!

Hi @mfletcher, this would tend to suggest that there may be a high rate of false positives in your novel variants. It's not a sure thing, but you may need to do some additional QC of your novel variants.

Geraldine Van der Auwera, PhD

• DEPosts: 23Member

Hi @Geraldine_VdAuwera‌ , thanks for the quick reply! I'll take a closer look at the novel variants and see if there's anything systemically dodgy about them. Cheers!

• Neuroscience Research AustraliaPosts: 2Member

Hi,

I would like to hear any thoughts on the tranche plot attached. This is on 18 samples of Illumina exome sequencing data, using the most recent best practices, with haplotype caller and the -L option limited to enriched regions (truseq). The Ti/Tv seem to be doing what is expected, and the R plots look fine, but the FP/TP boxes in the most sensitive tranche don't seem right. Any comments would be greatly appreciated.

Thanks,

Alex

Hi Alex,

This looks fine. It could possibly be a minor bug in the plotting (the FP/TP boxes in the most sensitive tranche do look a little weird), but it's not anything I'd worry about.

-Sheila

• Neuroscience Research AustraliaPosts: 2Member

Thanks for the response

• Posts: 16Member

"Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box."

This link appears to be disabled. Is there any updated link?

Hmm, this was due to a Dropbox security policy. I'm posting a new link now, let me know if you have any issues.

Geraldine Van der Auwera, PhD

• Posts: 16Member

That worked. Thanks!

• Posts: 20Member
edited June 2014

I was using R_3.x and got errors based on some ggplot2 deprecated functions ('opts' is now 'theme' etc). Plots can be generated by running:

sed 's/opts/theme/g' r_script.out.R | sed 's/theme_rect/element_rect/g' | sed 's/theme_line/element_line/g' | sed 's/+ theme(title=model PDF) //g' > r_script.out.R1;
mv r_script.out.R1 r_script.out.R;
R CMD BATCH r_script.out.R

Obviously (if you read R=) the 'title' is removed, but it makes no odds to the PDF.
Post edited by bruce01 on

Hi,

I have just fixed the script, and I will make the new version available later today.

-Sheila

• IndiaPosts: 29Member

Hi there!!

I have used variantrecalibration step in my project but it made just 2 plots in the pdf file- tranches and the ROC curve!! Can you tell me what is missing in my command?? I have installed latest R version and have also added Rscript path to environment variables.

The command is as follows:
java -Xmx4g -jar /tools/GenomeAnalysisTK-1.3-16-g6a5d5e7/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19.fa -input SRR_realign.recal.snp.vcf -recalFile SRR_new.output.recal -tranchesFile SRR_new.output.tranches --maxGaussians 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf -an QD -an HaplotypeScore -an FS -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -rscriptFile SRR_new_recal.plots.R

• IndiaPosts: 29Member

Also, I am not sure whether my tranches and TP vs FT rate plots are proper!!
Please guide me here!

• IndiaPosts: 29Member

I have used exome data!!

• IndiaPosts: 29Member

A pdf file is generated which should contain the pdf and scatter plots but the file is empty!! I dont kno what might be the issue!! does anybody know about this? please help me out here!

Hi,

Did you specify a filename for the recalibration plots?

The tranche plots look fine.

-Sheila

• Posts: 15Member

We have capture sequencing data. Our capture region is 270kbp and sample size is N=128. We tried to perform VQSR using VariantRecalibrator for SNPs and received the following error message:

##### ERROR MESSAGE: No data found.

Could you please let me know the likely cause? And what is the best solution?

Thank you for your help.

• IndiaPosts: 29Member

Hi Sheila!
No I did not, It created a pdf file with SRR_new_recal.plots.R.pdf (-rscript file name which I gave) with no graphs in it along with the tranches pdf. Is'nt -rscript file name enough or I will have to specify file name for recalibration plots separately?

Thank you!

Hi,

You need to specify the filename for the recalibration plots separately.

-Sheila

Hi,

Unfortunately, the capture region is too small. Even with many samples, it is simply impossible for the model to get enough variants to work with.

-Sheila

• IndiaPosts: 29Member

Hi sheila!

Could you tell me how to do that, what is the command for that?
--plotfile or what exactly?

Thanx

• ParisPosts: 25Member

Hello,

just a silly question.
I'm not working with human, but as far as I understand, your different resources contains redundant sites: most of the sites presents in your high quality resource "HapMap" are also recorded in the low quality dbsnp resource, am I right ?

So what happen for the classification of variants as known vs novel ?
For instances, if one given site belongs to HapMap, in your setup, it is tagged as unknown. But if this site also appears in dbsnp, then it is considered as "known". How does VariantRecalibrator manage this conflict ?

Thanks for your help,
Fabrice

• ParisPosts: 25Member

Hi,
I have a problem with my tranche plots: sensitivity tranches are messed up. My problem looks like what mprasad posted last 26th february, but I can't get where the problem lies.

I'm working with a non-human organism. I got two "resource" sets as follows:
-1 very short list (810 sites only) of known snp used and verified in a previous study (here called HighConfidencedbSNP.vcf)

-1 that I generated from the selection of very likely SNPs from my sequencing data, based on rationale from the design of my experiement (here called HK104_BackgroundSNPs.vcf)

Here is my command line:

java -Xmx4g -jar \$GATKDir/GenomeAnalysisTK.jar -R refgenome.fa -T VariantRecalibrator --input MA296vs211vsHK104_20X_HM-DP.vcf -resource:HighConfdbSNP,known=true,training=true,truth=true,prior=15.0 HighConfidencedbSNP.vcf -resource:backgroundSNP,known=false,training=true,truth=false,prior=10.0 HK104_BackgroundSNPs.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP --maxGaussians 4 -mode SNP -recalFile vqsr_snp2.recal -tranchesFile vqsr_snp2.tranches -rscriptFile vqsr_snp2.plots.R

And attached are my tranches files with the truth tranches are not properly ranked (100% threshold being before 99.9% threshold )

Any idea to fix this issue ? May thanks

• IndiaPosts: 29Member

Because I am not able to find any such option!! Also, I am using Gatk version 1.3!!

@bioinfo_89, you should not use such an old version. It is completely unsupported. You should upgrade to a newer version, otherwise we cannot help you.

Geraldine Van der Auwera, PhD

• IndiaPosts: 29Member

Hi Geraldine!!
Ok! I would try 2.8.1 Gatk version then!

• IndiaPosts: 29Member

I tried the same command uisng 2.8.1 version but I have still no PDF and scatter plots in the pdf file. When I click on the pdf file it says that the document contains no pages!!!

Please guide me through this issue!!

@bioinfo_89‌ Please tell me these things:

1. What command are you trying to run?

2. What is the output of the command? Is there an error message or does the program run to completion?

3. If the program runs to completion, what is the problem?

Geraldine Van der Auwera, PhD

• IndiaPosts: 29Member

Ok!

I ran Baserecalibrator with the following command:

java -Xmx2g -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fa -I SRR1024173_RG_index.marked.realigned.fixed.bam -knownSites dbsnp_137.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -o recal_data.table
This gave a recalibration table.

Then I ran printReads:
java -Xmx2g -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T PrintReads -R hg19.fa -I SRR1024173_RG_index.marked.realigned.fixed.bam -BQSR recal_data.table -o SRR1024173_RG_index.marked.realigned.fixed.recal.bam

I again ran BaseRecalibrator:
java -Xmx2g -jar /GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fa -I SRR1024173_RG_index.marked.realigned.fixed.bam -knownSites dbsnp_137.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -BQSR recal_data.table -o after_recal_data.table

The I used AnalyzeCovariates where I am getting error:

java -Xmx2g -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T AnalyzeCovariates -R hg19.fa -before -after after_recal_data.table -plots recal_plots.pdf

ERROR:

##### ERROR --------------------------------------------------------------------

Kindly let me know where did I go wrong in the analysis steps or the commands.

Well, first, that is a Base Recalibration problem, and this is a discussion about Variant Recalibration. But the underlying issue is the same so I'll give you a break.

See this article for instructions on how to solve the problem: http://www.broadinstitute.org/gatk/guide/article?id=4294

Geraldine Van der Auwera, PhD

• IndiaPosts: 29Member

Ok!! Well Thnk u !!

@FabriceBesnard Sorry I forgot to answer you -- it looks like your issue might just be a problem in the script with sorting the figure labels because the TiTv is the same for those two tranches. I don't think you need to worry about it.

Geraldine Van der Auwera, PhD

• Hong KongPosts: 4Member

Hi,

Can I please ask about the meaning of the "False Discovery Rates" (FDR) that specify the boundaries in the tranches? My understanding of FDR is that it is the proportion of True negatives amongst those that are declared positive, as e.g. in http://en.wikipedia.org/wiki/False_discovery_rate. However, this would suggest that even if we use the most stringent Tranche (say we give 90.0 as the lowest Tranche boundary), 90% of the novel variants are false positives. This seems a bit incredible, and is also inconsistent with the Tranche plots that I tend to get, which usually show that the proportion of false positives is minimal in that Tranche.

In part of the documentation, these rates appear to be defined in terms of sensitivities, which seems to make more sense, supposing the meaning of "sensitivity" is "proportion of declared positive among True positives". However, I am generally more interested in knowing the FDRs than the sensitivity, and it seems that the only way to get this information is to visually gauge it from the plots, which are also unavailable for indels. But perhaps I am misunderstanding/missing something here.

Thanks for your help.

Tim

The tranche boundaries refer to the sensitivity relative to a known dataset. For example, if you take the 90.0 tranche boundary, you're effectively setting the VQSLOD filtering threshold so that you know that if you were to apply it to the truth dataset, you would retrieve the top 90% of the variants in that dataset. If you increase the boundary, say to 99.0, then you would retrieve the top 99 % of the variants in that dataset. The FDR then depends on your confidence in the truth dataset.

Geraldine Van der Auwera, PhD

• Hong KongPosts: 4Member

Hi Geraldine,

Thanks for the answer, but I'm still unsure how you can calculate the FDR. For simplicity let us assume that we have 100% confidence in the truth dataset, i.e. that all the variants in the truth dataset are true variants, although not all variants outside the truth dataset are false calls. The FDR is the proportion of false calls among those that are declared PASS, and let's say that PASS is declared for those variants which have a sensitivity of 90% or less. Given this piece of information, surely it's insufficient for us to know the FDR. What other pieces of information am I missing and can these be obtained from GATK?

Tim

• United KingdomPosts: 254Member ✭✭

@tshmak you would need some high confidence sites to determine your FDR. Perhaps you have some high confidence calls from high coverage data? Otherwise you could call NA12878 along with your other samples and use this dataset:
http://genomeinabottle.org/blog-entry/availability-phase-consistent-gold-standard-variant-set-na12878-and-rtgtools-software

• Posts: 13Member

Hye Geraldine.

I am doing VQSR for human whole genome sequencing data. But I keep got an error for this step.

I used the command like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19/ucsc.hg19.fasta -input path_to_file/file.bam.vcf -resource:hapmap, known+false, training=true, truth=true, prior=15.0 /hg19/hapmap_3.3.hg19.vcf -resource:omni, known=false, training=true, truth=true, prior=12.0 /srv/disk01/hg19/1000G_omni2.5.hg19.vcf -resource:1000G, known=false, training=true, truth=true, prior=10.0 /hg19/1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp, known=true, training=false, truth=false, prior=2.0 /hg19/dbSNP138/dbsnp_138.hg19.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP -mode SNP -recalfile output_file_snp.recal -tranchesFile outut_file_snp.tranches -rscriptFile output_file_snp.plots.R

The errors that were appeared like:

##### ERROR Invalid argument value '/srv/disk01/Reference/hg19/dbSNP138/dbsnp_138.hg19.vcf' at position 29.

I also have re-run the command for couple of times and still get the same error. Could you please make me clear what were the error all about?

Many thanks in advance!

• Posts: 13Member

Hye Geraldine.

I am doing VQSR for human whole genome sequencing data. But I keep got an error for this step.

I used the command like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19/ucsc.hg19.fasta -input path_to_file/file.bam.vcf -resource:hapmap, known+false, training=true, truth=true, prior=15.0 /hg19/hapmap_3.3.hg19.vcf -resource:omni, known=false, training=true, truth=true, prior=12.0 /srv/disk01/hg19/1000G_omni2.5.hg19.vcf -resource:1000G, known=false, training=true, truth=true, prior=10.0 /hg19/1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp, known=true, training=false, truth=false, prior=2.0 /hg19/dbSNP138/dbsnp_138.hg19.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP -mode SNP -recalfile output_file_snp.recal -tranchesFile outut_file_snp.tranches -rscriptFile output_file_snp.plots.R

The errors that were appeared like:

##### ERROR Invalid argument value '/srv/disk01/Reference/hg19/dbSNP138/dbsnp_138.hg19.vcf' at position 29.

I also have re-run the command for couple of times and still get the same error. Could you please make me clear what were the error all about?

Many thanks in advance!

• Posts: 439Member, GSA Collaborator ✭✭✭✭

@Azlin‌ -

The problem is that you added spaces after the commas in the -resource parameter. Those spaces cause each of the clauses to be recognized as a separate argument, not part of the -resource string.

Also, you have a typo ("known+false") in your hapmap parameter

• Posts: 13Member

Hi @pdexheimer

Thank you. Thank you. Thank you very much!!!
I don't even notice that. Oh my..

Thank you more ^_^

• Posts: 3Member

We try to compare the variant quality from different sources. We notice that both VQSLOD and QUAL scores can be used to access the variant quality. What is the difference of VQSLOD and QUAL? Which one is better for this purpose?

Thank you for your help.
Qiang

@QiangHu The QUAL score is the score that is produce for the variant in isolation by the variant caller. The VQSLOD is a score that is computed after variant recalibration, which involves applying machine learning to detect covariation between the truth status of training variants and the values of their annotations, based on a large number of variants. So VQSLOD is a better predictor of variant quality, especially when it is interpreted relative to cutoff scores computed based on highly validated truth data. See the documentation on variant recalibration for more details about this process.

Geraldine Van der Auwera, PhD

• SpainPosts: 25Member

Hi,

I have been reading posts on this section and others trying to find a solution to my problem,without much luck. Thing is that I am analyzing about 20 human exome data. I have been processing with GATK v 3.2 released in July
I pre-processed samples capturing just exome target regions (SeqCape_EZ_Exome_v2) using the -L command +- ip 100 bp. Then I've called variants and checked its quality using: (1) UG+VQSR and (2) HC+GenotypeGVCFs+VQSR, following your Best Practices recommendations. I have come up with two issues.
1- I obtain very low ti/tv ratio, about 1.9 (when for exome data around 2.9 should be expected, right?)
2 - tranche plots obtained from the UG or the HC pipeline are very different ( I attach both outputs for comparison)

I am aware that UG is not recommended for small datasets (like my 20 samples), however, in my opinion, UG is giving me better results than HC. any hints on this behaviour? and suggestions on how to improve tranche plots in HC and overall ti/tv ratio results?

Victoria

• SpainPosts: 25Member

I was wondering whetehr you or any of your team had the time to look at my previous post?

Victoria

Hi Victoria, (@vihefe)

I'm sorry, your question slipped off our radar.

The TiTv calculations made for the plots are relative to the expected TiTv used internally for plotting, which for exomes you should set using the --target_titv argument (otherwise the value for whole genomes is used by default).

It looks like the plots are actually essentially the same but for an anomaly that causes the 90 tranche to have a TiTv that is out of order and therefore it is plotted in the wrong place, messing up the plotting of the next tranche as well. I'm not sure why that is happening but it could be because the actual TiTv of that tranche is higher than the target value, which would lead to some buggy behavior. If so, setting the target titv should fix this issue.

Geraldine Van der Auwera, PhD

• luxembourgPosts: 12Member

Hi,
I have analyzed exome seq data using GATK best practices, However I am getting really low ti/tv ratios. I have used the target ti tv and -L arguement. So, should I use the hard filtering approach or is there any other way to improve my ti/tv ratio. Please let me know, attached are the pdf files generated. Thanks in advance.

Hi @dheeraj‌,

The recalibration plots look fairly normal at first glance but I must admit that the tranches plots look spectacularly bad. Can you please list all the command lines you ran as part of this analysis? Is this human data or something else?

Geraldine Van der Auwera, PhD

• luxembourgPosts: 12Member

This is a human data. Below are the command lines.

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_Ensembl_MT_66.fasta -input recalibrated.VCF -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf.gz -an InbreedingCoeff -an MQ -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP --target_titv 3.2 -L Exome_interval.list -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP_selected.recal -tranchesFile recalibrate_SNP_selected.tranches -rscriptFile recalibrate_SNP_plots_selected.R -nt 12

Ok, this all looks normal. I'm just curious about one thing -- your input vcf file is called "recalibrated.VCF". Does that mean those variants have already been recalibrated previously? Or do these come straight from the caller? How did you generate this callset?

Geraldine Van der Auwera, PhD

• luxembourgPosts: 12Member

The input file was generated from GATK genotypeGVCFs. I think some of the samples were really bad, maybe they have effected the ti/tv ratio ???? I am not sure. Could you please suggest me any alternative filtering??. Thanks in advance.

• SpainPosts: 25Member

The TiTv calculations made for the plots are relative to the expected TiTv used internally for plotting, which for exomes you should set using the --target_titv argument (otherwise the value for whole genomes is used by default).

It looks like the plots are actually essentially the same but for an anomaly that causes the 90 tranche to have a TiTv that is out of order and therefore it is plotted in the wrong place, messing up the plotting of the next tranche as well. I'm not sure why that is happening but it could be because the actual TiTv of that tranche is higher than the target value, which would lead to some buggy behavior. If so, setting the target titv should fix this issue.

I followed your suggestions on running -VariantRecalibration with -titv 3.2 (as suppossed for WES) but I did not see any improvement in tranche plots. they looked essentially the same. However, my major concer is the substantially low ti/tv values ~1.6 (as obtained witha larger dataset. The new attached pdf files correspond to 400 WES samples ran with -GenotypeGVCFs + VariantRecalibrator, being ran the second file with -titv option - which in my opinion gives worst results.
Nonetheless, my major concer is the high number of False Positives ... should I apply stronger or extra filters? that would be much appreciated.

Thanks

V

It's really hard to say, @vihefe. I would recommend doing further analysis to find out if the amount of known/novel variants and the titv ratio is the same across all your samples, or if it is different in some subsets. Maybe you have some failed samples in there that are messing up the overall stats.

Geraldine Van der Auwera, PhD

• Hong KongPosts: 1Member

Hi!
We are currently running GATK for variant calling on one human exome sample downloaded from NCBI. It is suggested that variant recalibration works best with at least 30 samples but we only have one sample. Should we run VQSR as normal or add additional samples like CEUTrio from resource bundle? In either case, should we adopt GVCF output in HC? Additionally, when interval list is needed, what is the difference between -L 20 and using interval.list file downloaded from resource bundle?
Thanks
Christine

Hi Christine,

Please have a look at this article (under Important Notes for Exome Capture Projects) https://www.broadinstitute.org/gatk/guide/article?id=1259

You can get the bam files from the 1000Genomes Project here: http://www.1000genomes.org/data

After you get the bam files, the best thing to do is run the GATK pipeline on those bams, and when you get to the variant calling stage, run joint genotyping on the 1000Genomes bam files as well as your sample bam file. https://www.broadinstitute.org/gatk/guide/article?id=3893

The -L argument is for inputting the intervals you would like to perform analysis on. -L 20 means only perform analysis on chromosome 20. For your exome data, you will want to input the intervals that you are interested in.

I hope this helps.

-Sheila

• United KingdomPosts: 254Member ✭✭

Can I put called INDELs into one of two categories as 0 and 1 (or 1 and 2) and use this annotation for VQSR? Or does it only work properly for annotations with continuous values? I will attempt to figure it out myself. I'm just asking in case anyone has some prior experience with this.

Unfortunately the VQSR algorithm assumes that annotation values follow a gaussian distribution, so my guess is it won't be very happy with a binary classification. We have a reimplementation prototype that uses random forests that would be much better for what you want to do, but it's in its infancy and that project is currently on hold due to other pressing priorities. (so yeah, that was a useless comment, sorry)

Geraldine Van der Auwera, PhD

• Tarrytown NYPosts: 1Member

When running VQSR on a pVCF what does the PASS/Tranche value output actually represent. Is this based on the quality of the variant called in ALL samples at that position?

Thanks.


• United KingdomPosts: 254Member ✭✭

@Geraldine_VdAuwera said:
Unfortunately the VQSR algorithm assumes that annotation values follow a gaussian distribution, so my guess is it won't be very happy with a binary classification. We have a reimplementation prototype that uses random forests that would be much better for what you want to do, but it's in its infancy and that project is currently on hold due to other pressing priorities. (so yeah, that was a useless comment, sorry)

On the contrary! Thanks for clarifying regarding binary classification and Gaussian distributions! I know one can do binary classifications with support vector machines, but I tried doing SVM in the past with the scikit-learn module for Python and it performed poorly compared to VQSR. It seems scikit-learn has a random forest classifier, which I will try out, if I can somehow find the time:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
`

Not exactly, because we wouldn't want to filter out variants that are real in some samples just because other samples in the cohort are uncovered or are invariant. The annotations used by VQSR are calculated to reflect the amount and quality of evidence for variation present in the subset of samples that have a non-hom-ref genotype.

Remember that the VQSR model learns what are the possible annotation profiles of good variants based on the truth set, and then scores all the variants based on how their annotation profiles relate to the possible "good" profiles. The result is the VQSLOD annotation, which is the log-odds ratio of the variant being real. It's a relative scale that is not meant to be interpreted absolutely.

In relation to that, the tranche cutoffs are values of VQSLOD that correspond to specific levels of sensitivity to the truth set. So the PASS/Tranche status of a variant simply describes which slice of data the variant belongs to, relative to that scale of sensitivity.

Geraldine Van der Auwera, PhD

@tommycarstensen We had an intern this summer who got the scikit-learn package to work for a prototype RF-VQSR implementation, if I recall correctly. So that would be a good avenue to explore. Let me see if we can share some of that work.

Geraldine Van der Auwera, PhD

• jhuPosts: 1Member

@Geraldine_VdAuwera I am wondering if you can give some update on the implementation of RF-VQSR prototype. We are interested to see how that works. Will appreciate if you can share some of that work.

• United KingdomPosts: 254Member ✭✭

@peng said:
Geraldine_VdAuwera I am wondering if you can give some update on the implementation of RF-VQSR prototype. We are interested to see how that works. Will appreciate if you can share some of that work.

Uhhh, I would like to see some slides as well @peng you can try it out yourself in the meantime here.