Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Variant Quality Score Recalibration (VQSR)

rpoplinrpoplin Posts: 122GATK Developer mod
edited June 2 in Methods and Workflows

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

  • VariantRecalibrator
    Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

  • ApplyRecalibration
    Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

image

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

image

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

  • The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
  • The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
  • The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.

Post edited by Geraldine_VdAuwera on
«1

Comments

  • lbernalberna Posts: 5Member
    edited November 2012

    Hi Geraldine, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...

    I am working with each strain as different organism, so I have good coverage (80X) but only one Lane

    I tried with:

    java -Xmx4g -jar GenomeAnalysisTK.jar  -R S288c.fasta -T VariantFiltration  --variant $1.raw.vcf  --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 "  --filterName "hardtovalidate"   -o $1.filt.vcf
    

    to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi there,

    The exact values can depend a lot on your data set. You'll need to experiment a little and see what works best for you. Two ways to check if you are getting good variants is to look them up in a genome viewer like IGV, and to call variants with a different program to see what's called in both.

    Good luck!

    PS: you have a typo in your command line, I think you mean MQRankSum rather than MQEankSum.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    In addition, you should have a look at the hard-filtering parameters given for this purpose in the Best Practices document.

    http://www.broadinstitute.org/gatk/guide/article?id=1186

    Geraldine Van der Auwera, PhD

  • FrancoisGuillaumeFrancoisGuillaume Jouy en josasPosts: 4Member

    Hi, I tried the command indicated in the FAQ "Does the VQSR work with non-human variant calls?"

    --B:concordantSet,VCF,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf

    But the argument doesn't seems to be recognized (at least with gatk-2-2.5). If this is a typo or depreciated options, could it corrected ? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    That specific command syntax is deprecated. The values are still applicable, you just need to pass them using the new syntax. Examples of the new syntax is given here:

    http://www.broadinstitute.org/gatk/guide/article?id=1259

    We are periodically reviewing and updating the entire documentation; this will be corrected when we come to it.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Sanger, Cambridge, UKPosts: 69Member
    edited November 2012

    Just to clarify will I have problems with VQSR if the number of input samples (4 WES) rather than sites is small?

    Post edited by TechnicalVault on

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    It really depends on the total number of sites you have. You may be okay if your samples have a lot of variation. The best way to know is to try and see if the recalibrator complains...

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited November 2012

    Hi:

    Just a quick question. For VariantRecalibrator, the command is described as below on the documentation:

    java -Xmx4g -jar GenomeAnalysisTK.jar \
       -T VariantRecalibrator \
       -R reference/human_g1k_v37.fasta \
       -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
       -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
    ......
    

    I used old version of GATK, for the last option -resource:dbsnp,, the prior used to be 8.0, now it is prior=6.0 for the new version, is this intentional or just a typo? or due to the change of the dbsnp version as 135 instead of 132?

    Thanks a lot

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    I'll check this with @rpoplin, who knows this stuff better than anyone, but I'm pretty sure it's intentional and linked to the dbsnp version.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thank you very much! Look forward to that!!

    I also have a related question also about VQSR, For the new version, the above command is mainly for SNP's VQSR step, I did see some description on Indel VQSR, but not much details or example commands like the ones I pasted above..so my question is: for Indel's VQSR, what training data I shall use and more importantly, what prior values I shall use for each of them. My indel callset has about 4000 of them, looks like a workable number for VQSR step based on the VQSR documentation, right?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    OK, I checked and that is indeed the case -- the "8" prior is what was used for the tutorial given on the VQSR page, which uses dbsnp version 132, while the updated Best Practices use dbsnp version 135 with prior=6. FYI, the hierarchy of documents is that the Best Practices is always the most up-to-date, so when in doubt, use whatever is given there. Next are FAQs, which we do update as needed. Unfortunately we don't have the time to update all tutorials and examples (such as you find in the VQSR article), so those may not use the latest values and command lines. That is why it is important to check the Best Practices.

    For indel resource recommendations, see the before-last paragraph in this doc:

    http://www.broadinstitute.org/gatk/guide/article?id=1247

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Dear Geraldine:

    Thanks so much for the info and detailed answers, Appreciated very much!

    Just want to confirm my last question on indel using VQSR: based on that doc from your link above

    -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf

    is the only resources I needed for the VQSR step? It is equivalent as the hapmap resources for SNPs' VQSR. the reason I am asking is for SNPs, there are two additional resources as below: -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \

    with known and training and truth setting at different levels using Omni and dbSNP, whereas for indel we only need one setting as known=false,training=true,truth=true using the gold.standard.indel, is my understanding correct?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Mike,

    That's correct, for indels the Mills set is the only resource for which we have a good evaluation of quality, and so it's the only one we use for VQSR.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited November 2012

    Thanks again, Geraldine!

    I just finished the indel VQSR and the tranches plot is kind of odd although Gaussian mixture model plots seem not so bad. So I have to come back to ask your opinion on this.

    Here is the actual command I used:

    java -Xmx4g -jar /Path/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar  
    -T VariantRecalibrator 
    -R /Path/hg19.fa 
    -input /Path/GATK_UG_SelIndel.vcf 
    -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 /Path/bundle-1.5/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf 
    -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff  
    --maxGaussians 4 
    -recalFile /Path/GATK_UG_SelIndel.VarRecal 
    -tranchesFile /Path/GATK_UG_SelIndel.tranches 
    -rscriptFile /Path/GATK_UG_SelIndel.R 
    -mode INDEL
    

    I also attached the pdf file and image for the tranches plot as below

    image

    Here is my tranches file:

    # Variant quality score tranches file
    # Version number 5
    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,0,2997,0.0000,0.0000,0.3232,VQSRTrancheINDEL0.00to90.00,INDEL,2724,2451,0.8998
    99.00,0,4033,0.0000,0.0000,-0.7785,VQSRTrancheINDEL90.00to99.00,INDEL,2724,2696,0.9897
    99.90,0,4517,0.0000,0.0000,-10.5409,VQSRTrancheINDEL99.00to99.90,INDEL,2724,2721,0.9989
    100.00,0,4539,0.0000,0.0000,-52.4947,VQSRTrancheINDEL99.90to100.00,INDEL,2724,2724,1.0000
    

    My question would be: why the tranches plot looks like no TP at all, however, at 0,99 or 0.90, the result vcf files (after ApplyRecalibration) both have many "Good" indels with "PASS" at the column "FILTER"?

    Thanks a lot for your help!

    Mike

    pdf
    pdf
    GATK_UG_SelIndel.tranches.pdf
    8K
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin
    edited November 2012

    Hi Mike,

    In fact the tranches plot is not relevant for indels; in the next release we are going to disable the generation of this pdf in indel mode since it is confusing. The proper thing to look at for indels is the Gaussian mixture model plots, so if yours look ok you're probably fine.

    Post edited by rpoplin on

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks for the info!

    Just in case, since I did see difference of the plots compared to those from SNPs' VQSR, here I attached the pdf plot, could you take a look at it and see if it is OK?

    Thanks

    Mike

    pdf
    pdf
    GATK_UG_SelIndel.R.pdf
    17M
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Sorry Mike, I can't comment on individual results unless there's some evidence of a bug; otherwise everyone will want to have us smell-check their results, and we just don't have the time for that...

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks and that's fine. The only reason I asked is that this is my first time use VQSR on indels, not sure how the plots supposed to look like,. It would be nice if there are some comments or suggestions on the indel plots from VQSR modeling somewhere in the documentation. Thanks again, Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Mike,

    We've put adding an indel example to the VQSR doc on our TODO list. In the meantime, all I can say is that indel plots aren't expected to be very different from SNP plots, generally speaking.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Sounds great! Thank you, Geraldine! Mike

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    I am reading the above posts and I can see that you are busy but I cannot solve at present the following problem in my VQSR: the tranches plot seems incorrect (bar length is incorrect I think) although the tranches file may be OK. Also I wonder if somewhere the false positive rate values for every tranche are printed? The data are from human exome sequencing.

    This is the tranches file:

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,15999,75,3.1949,2.5714,0.7305,VQSRTrancheSNP0.00to90.00,SNP,15635,14071,0.9000 99.00,18643,334,3.0884,1.6094,-6.7607,VQSRTrancheSNP90.00to99.00,SNP,15635,15478,0.9900 99.90,19279,377,3.0173,1.4013,-40.2908,VQSRTrancheSNP99.00to99.90,SNP,15635,15619,0.9990 100.00,19430,400,2.9856,1.2857,-27457.4005,VQSRTrancheSNP99.90to100.00,SNP,15635,15635,1.0000

    Please see the attached pdf file of tranches plot.

    Thank you in advance for your answer

    Sincerely Dimitar Zankov dzankoff@belle.shiga-med.ac.jp

    pdf
    pdf
    SNPs_VQSR.tranches.pdf
    8K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Dimitar

    Can you tell me which version you are using? If not the latest, can you please try again with the latest version and let me know if the problem persists?

    Geraldine Van der Auwera, PhD

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    Thank you very much for the answer! I think I am using the latest GATK v2.3-9-ge5ebf34. If I can find out the false positive rate (or calculate somehow) for every tranche I can build the plots by myself, I guess. Could you tell me how to do that (if it is not big trouble or too complicated)?

    Thank you

    DImitar

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Dimitar,

    After consulting the developer, it seems that the most probable reason your tranches plot looks weird is because you are using the default Ti/Tv ratio, which is the value estimated for whole genomes. The value of the expected Ti/TV ratio is used to calculate the FP rate. If you are working with exomes, you need to change that value (using the --target_titv argument described in the Technical Documentation page). You'll need to decide which --target_titv ratio is most appropriate for your experimental design.

    Good luck!

    Geraldine Van der Auwera, PhD

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    Thank you very much for your very kind help and detailed explanation! I will try what you suggested. I appreciate very much the GATK and dedicated team behind it!

    Thank you again Dimitar

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    Just to confirm that your advice works. Please see the pdf file. Tranches look OK (I hope) at TiTv set to 2.8. Cannot say how grateful I am.

    Dimitar

    pdf
    pdf
    VQSR.tranches.pdf
    21K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Thanks for confirming, Dimitar, I'm glad to hear it worked for you.

    Geraldine Van der Auwera, PhD

  • simonwilliamssimonwilliams Posts: 7Member

    Hi, I'm also having problems that can be seen in the tranches plots. I'm running UnifiedGenotyper on 30 exomes. My VariantRecalibrator command is:

    -input snps.raw.vcf -R /path/to/hg19.fa -T VariantRecalibrator -mG 6 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/hapmap.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /path/to/omni.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /path/to/dbsnp.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode SNP -recalFile combined_snps.recal -tranchesFile combined_snps.tranches -rscriptFile combined_snps.plots.R

    The plots seem to be formatted incorrectly - probably due to my own errors rather than a bug - but I am getting far too many FPs. Could this be due to something obvious that I'm doing wrong? Thanks for any pointers, Simon

    pdf
    pdf
    combined_snps.tranches.pdf
    8K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Simon,

    It might be the same problem as Dimitar had; see my explanation about the Ti/Tv ratio above. Try adjusting that and let me know if you're still having issues.

    Geraldine Van der Auwera, PhD

  • simonwilliamssimonwilliams Posts: 7Member

    Thanks for the quick response Geraldine,

    I added the parameter '--target_titv 2.8' as suggested by Dimitar but it doesn't seem to make any difference to the plot.

    The top plot seems to be formatted incorrectly as the order of truth tranches is reversed from the other plots displayed on this page i.e. 100, 99.9, 99, 90 (top to bottom) as opposed to 90, 99, 99.9, 100. Is this something I can control? And should I be concerned by the potentially low ti/tv ratios?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Simon, sorry for the delay.

    Could you post the log from the VariantRecalibrator run and the contents of the tranches files?

    Geraldine Van der Auwera, PhD

  • simonwilliamssimonwilliams Posts: 7Member

    Thanks for looking at this. Here is the contents of the tranches file:

    Variant quality score tranches file

    Version number 5

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,3153120,508705,2.1208,0.7094,5.7019,VQSRTrancheSNP0.00to90.00,SNP,1771460,1594314,0.9000 99.00,3776458,667766,2.1336,0.7385,1.7651,VQSRTrancheSNP90.00to99.00,SNP,1771460,1753745,0.9900 99.90,4062619,763129,2.1068,0.7733,-4.6855,VQSRTrancheSNP99.00to99.90,SNP,1771460,1769688,0.9990 100.00,4192268,848731,2.0847,0.8108,-7549.2384,VQSRTrancheSNP99.90to100.00,SNP,1771460,1771460,1.0000

    See attached for the variantRecalibrator log file.

    Simon

    log
    log
    variantRecalibrator.log
    17K
  • rpoplinrpoplin Posts: 122GATK Developer mod

    Hi there,

    We've taken a look at the logs and don't see any red flags which might explain this behavior. I wonder, can you please tell us more about how your input bams and variant calls were generated? This seems like way too many SNP calls to come from only 30 human exomes.

    Thanks,

  • simonwilliamssimonwilliams Posts: 7Member

    I think I might have an idea now. I have been splitting the original bams into separate chromosomes using samtools and running UG over each. I haven't been specifying the chromosome with -L (I didn't think I'd need to) but looking at the logs for these, the progress meter shows that it is walking over them all. I'm guessing that this why I have so many snp calls and that this is what is messing with my tranches. I'm running it again in the correct way and will let you know the outcome. Thanks

  • hmk123hmk123 Posts: 11Member

    I may be overlooking something, but is there a way to get a "clean" VCF file out after VQSR? By this I mean a VCF file where variants that are filtered by this are removed (not just marked as problematic)? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    No, that's not possible. The point of VQSR is that you can fine-tune recalibration by making several passes over the same file. If it were to actively remove variants it would break the fine-tuning functionality. To produce a "clean" VCF as you describe, just do an additional step with SelectVariants.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    This might be a silly question, but I was wondering whether having some PASS variants with a negative VQSLOD score was normal. In particular, I see this for indels that are called. They aren't very large negative numbers but still this would suggest a greater probability that the variant being called is false vs. true, so I'm just wondering why these would not be filtered ? Are there any special cases?

    Thanks,

    MC

  • chongmchongm Posts: 33Member

    Also I used a truth sensitivity level of 95% as recommended for indels.

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi there, sorry it took me a while to get back to you on this.

    Yes, it's ok to have PASSing variants with a negative VQSLOD. The threshold for filtering is based on the truth sensitivity and not the VQSLOD, so if you asked for a larger truth sensitivity then that would dip down lower and lower on the VQSLOD scale.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Hi Geraldine,

    I see, so do you recommend adjusting the truth sensitivity until there are no negative VQSLOD scores?

    Thanks,

    MC

  • chongmchongm Posts: 33Member

    Would this be the equivalent of just filtering out negative VQSLOD scores?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    No, you just shouldn't worry about negative VQSLODs. Truth sensitivity is a better criterium.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    For the tranches plot file, how are the number of true positive and false positive variants determined for novel variants?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Those are estimated based on the deviation from the specified known ti/tv ratio.

    Geraldine Van der Auwera, PhD

  • hmk123hmk123 Posts: 11Member

    I have a question about small sample sizes. We have run the VariantRecalibrator for a small targeted resequencing project. I have added this to the command line: --maxGaussians 4 --percentBad 0.05, but I still get a warning. I have tried to add .bam files from 1000s genomes (in the Unified Genotyper step), but I still get warnings. We have another 200 subjects that we have sequencing from. But our original data was from SOLiD but the new data is Illumnia. Can I put 2 different vcfs in to recalibrate? Because I don't think I would be able to put bam files from SOLiD and Illumnia into the Unified Genotyper? But please correct me if I am wrong.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi there,

    Mixing technologies is generally speaking a bad idea, but if you have 200 samples with each then it should be fine because the program will have enough data to learn clusters specific to those samples in addition to learning the clusters that are shared between the technologies.

    As for how to proceed, it doesn't really matter whether you input the two VCFs directly into the VR or whether you give all 400 bams to the UG and make one callset. Just make sure that the callsets are generated using the same calling parameters, so that the call stats are comparable.

    Geraldine Van der Auwera, PhD

  • WimSWimS Posts: 25Member

    I have 10 strains of the same non-human species and I want to use VQSR on the raw single sample SNP calls. A subset of the 10 strains have been genotyped using a SNP array. How can I use that SNP array as a truth dataset? The snp array data is in some kind of a csv format. Can I just convert that format to the BED format and extract those positions from the single sample VCF, to a truth VCF, and supply that as a truth dataset to VQSR?

    Can I use the same dataset as a training dataset, or does this need to be a different ( bigger and / or non overlapping?) dataset than the truth dataset? Or can I just for example take all the high quality (quality above 100) from the single sample raw SNP calls and supply this a a training set?

    I also have reference call's in my raw single sample VCF's. Do I manually need to exclude them completely from the VQSR process?

  • rpoplinrpoplin Posts: 122GATK Developer mod

    Hi there,

    I'm glad you are having fun experimenting with the VQSR. You are correct that the first step in using your array data as a training set with the VQSR is to convert it into a VCF file.

    You can use the same input callset as the training set if you want. The VQSR does a clustering procedure and so you are essentially evaluating the variant calls by how well they cluster with the rest of the variants in your callset. The higher quality your training set is the more valuable this procedure will be.

    The VQSR will treat each record in your input VCF as a variant to cluster and evaluate-- so the answer to this question really depends on your goals. If you have meaningful variant annotations for your reference sites and there are meaningful reference training sites then it should just work but otherwise the best thing would probably be to remove them from the callset because they aren't variants that should be clustered with the rest of the variants.

    I hope that is helpful. Cheers,

  • mikemike Posts: 103Member

    Hi,

    I have a question about VQSR on indels variants. since indels largely depends upon the alignment where how exactly the indels are annotated for their locations (POS column of vcf files) and changes (in REF and ALT columns of vcf files). So what if the indels in GATK training data (e.g., Mills_and_1000G_gold_standard.indels.hg19.vcf) have just slightly different annotations (e.g.location, REF, ALT) compared to indels in my input vcf file (e.g. due to alignment difference or annotation difference especially in some short repeat sequences etc), does VQSR "smartly" know the indels from training data and input file are the same indels despite of slight difference in annotation of the indel or the way of annotations? Or this won't impact VQSR much as long as the two have enough overlapping of same indels? I just curious if GATK has done something to deal with those indels cases. When we used different tools to call indels, we (also others) keep seeing many indels appear to be different ones but in fact refer to the same indels if we look more closely and they are presented in vcf file of two variant calling methods in slightly different ways that could make us think they are different indels.

    Thanks and best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Unfortunately, if the indel is annotated differently as you describe (eg the location is off by one) then the GATK won't recognize that it is the same as a known indel. Doing realignment around indels with a file of known indels should largely compensate for this, however.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thx a lot for the info, which is what I like to know.

  • mikemike Posts: 103Member

    Hi,

    For VQSR, I found for -mode option, besides SNP or INDEL choices, there is a third choice as BOTH. I am a bit confused here, since almost all of the command examples show commands of VQSR quite differently for Indel and SNPs especially for the training resource files part and even best practice mentioned building VQSR model differently for indels and SNPs. I usually separated them out and do VQSR separately on indels and SNPs. Is there any reasons or circumstances that we need do VQSR using BOTH for -mode option? What is pro and con for that? Since the BOTH exists for the --mode option, I just wonder what and when to use it? if I used BOTH option for -mode, I need supply training files for both indels and SNPs at the same time, and GATK will automatically build the VQSR models separately for indels and SNPs, right?

    Thanks a lot!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Mike,

    We no longer recommend using the -BOTH option; I'll check with the team whether it would be good to disable it entirely to avoid any confusion.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks for the clarification. Just want to make sure, since one of my colleagues indeed had used that option )-BOTH) in VQSR to generate the final variants callsets, do you think it is still OK or we shall re-do it by first separating the indels and SNPs out into separate vcf files and then run VQSR on them separately?

    What is your suggestion?

    Thanks again

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi Mike,

    I would indeed recommend redoing the recalibration, but no need to separate out the snps and indels. You can just recalibrate them successively in the same file because when you specify eg SNP mode, the indels are emitted without modification, and vice-versa. See the VQSR tutorial for a detailed walk through of the process.

    Geraldine Van der Auwera, PhD

  • bd5fh2bd5fh2 Posts: 5Member

    Hi, I hope this is the right place to post this question about vqsr FILTER field. After going through the gatk forum I couldn't find a satisfactory answer to my question for a clear interpretation of the definition of . (dot) value in the FILTER field assigned by VQSR. This description is the closest I can get: "FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis."

    However, it doesn't explain well exactly in what situation(s) no filtering is applied to an SNP, and why? Should I rely on "PASS" only, or should I consider those with "." in a separate bucket?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    The GATK callers (UnifiedGenotyper and HaplotypeCaller) emit all variant records with a dot in the FILTER field by default, except those that are under the call confidence threshold (but still above the emit confidence threshold), which have FILTER instead of a dot. After that, any filtering you perform on the variants will typically change the variant FILTER status to either PASS or a fail value. The only exception is when you run VQSR for example in SNP mode, which leaves any indels untouched, and vice versa.

    So typically if you encounter variants with a dot in the FILTER field, it means that either it is a raw callset just fresh from calling, and no additional filtering has been applied yet, or VQSR has only been run on the other type of variant.

    Does that help?

    Geraldine Van der Auwera, PhD

  • morgantaschukmorgantaschuk Posts: 16Member

    Could you tell me when VariantRecalibrator would use more memory? I have two tests with whole genomes, one with a single library, and the other with multiple lanes with the same library (technical replicates). Both sets are dedupped, realigned, recalibrated, and variants called with UnifiedGenotyper. The files are comparable in size - the single lane of variants is 814M, and the merged lanes of variants is 954M. Here is the command line I'm using:

    java -Xmx18000M -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_random.fa -mode SNP \
    -tranche 99.0 -percentBad 0.01 -minNumBad 1000 -input merged_raw.vcf -recalFile snp.recal -tranchesFile snp.tranches  \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf \
    -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf \
    -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -nt 8

    Despite the fact that I'm only giving 18G to the JVM, the amount of memory used on the machine for the single lane of variants hits a maximum of 38G. The merged lanes of variants only use ~19G. They both complete, thank goodness, but I'm very curious as to why the single lane needs so much more memory than the merged set. Lower confidence perhaps?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hmm, I don't think I've seen this before. Is this something you observe consistently or is it a one-off observation?

    Geraldine Van der Auwera, PhD

  • morgantaschukmorgantaschuk Posts: 16Member

    I'm running analysis now to determine whether it's consistent. Unfortunately, it'll be a few days before I have an answer. Do you have any idea why it would balloon like that?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Not really, no. If it's a one-off thing it could be just a server hiccup...

    Geraldine Van der Auwera, PhD

  • williamwwilliamw North CarolinaPosts: 2Member

    I am trying to call variants at a number of specific sites across the genome (splice sites) over about 100 samples. I've run haplotype caller on this data set and produced a set of about 3117 snps at the TS=99.0 level. The tranche plot for the snps recalibration looks pretty similar to the one above ( although the Ti/Tv ratio drops to about .9 in the 99.9 level which greatly increases the amount of false positives there. The tranche plot for after I filter the indels is very stange looking however....The bars actually go towards a negative number of novel variants? Not quite sure how to interpret this. Some help would be appreciated!

    pdf
    pdf
    allsamples_allsplicesites_SNP_INDEL.tranches.pdf
    7K
  • JiahaoJiahao Posts: 2Member

    Hi,

    I have run the GATK realn-recal bam processing, UnifiedGenotyper and HaplotypeCaller for the tumor-normal paired data from exome capture. Now I have the raw variant vcf results, and am wondering whether should I employ the VariantRecalibrator, which is recommended for large data set (with large number of samples?). If yes, should I merge the vcf files from tumor and normal in advance? or proceed them seperately?

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi @williamw,

    The problem here is that the tranches plot is not applicable to indels. In the next version we will disable generation of the plots to avoid this confusion.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi @Jiahao,

    You can only run VariantRecalibrator jointly on samples that were called (run through UG or HC) jointly in the first place. So if you want to recalibrate T-N pairs together, you should go back and call them jointly, if you didn't already do that.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member
    edited November 2013

    Question about VQSR plots: I don't have much experience in determining bad plots, but based on my understanding of this page, my attached example looks pretty bad - what do you think? I am trying to figure out how to improve these. My dataset is whole exomes (~ 1,100 samples), called recently with version 2.7-2 (UG) with your latest recommendations.

    Would any of the following contribute to this effect :

    • I ran UG -stand_call_conf and -stand_emit_conf set to 10 (default is 30)
    • I used ts_filter_level 99.9 (in your best practices). I am dealing with a homogeneous population, according to the paragraph below, would it be better to use 99.0 ? "Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope."
    • I ran VQSR using numBad 10000, 20000, 30000, 40000, 50000 and 100000 (out of ~ 1million SNPs). The different values here don't seem to make any drastic improvements.

    Would appreciate any thoughts/suggestions, as I'm a bit lost at this point. I have just watched your videos on this again, haven't made much progress.

    vsvinti_vqsr_plot.jpg
    1400 x 1400 - 120K
    vsvinti_tranche_plot.png
    893 x 531 - 81K
    Post edited by vsvinti on
  • vsvintivsvinti Posts: 47Member

    As a side, is HaplotypeScore worth looking at? The new best practice does not seem to list it in the recommended options for VQSR

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi @vsvinti,

    That does look pretty bad. You mention you're working with exomes; did you use a list of capture targets for the pre-processing and calling steps?

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Yes, I have used the -L option with intervals at each step of the process (including VQSR). I have split those into several files though, and ran UG on each, then concatenated the outputs (the remaining steps were run on concat + sorted vcf). I have just repeated this with ts_filter_level 99.0 and the plots have gotten worse.

    I have done the whole thing previously on ~ half of these samples with gatk 2.5-2, and the plots looked ok then. Any ideas of checks I could do to find the problem would be much appreciated. My data looked pretty good up to this point!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Can you try running the VQSR step using GATK 2.5-2? It would be good to know if the results are significantly different, in case it's an issue with the recent changes to VQSR. If they're not then maybe it's an issue with recent changes to the calling step.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Ok. I'll try to find the exact options I used last time for 2.5-2. Will report back soon.

  • vsvintivsvinti Posts: 47Member
    edited November 2013

    Alright, here it is. Repeated the VQSR step with gatk 2.5-2, using the recommended options at the time I ran this last.

    Either:

    • The update in the recommended options create such a stark difference in results, or
    • The differences are due to the VQSR version

    They are looking pretty good to me - what do you think? (I don't know about the tranche, have to read a bit more on that). Also, how come it looks like there are so few points in this graph, compared to above (or are they overlaid)?

    vqsr_v2.5.png
    601 x 600 - 160K
    tranches_v2.5.png
    891 x 563 - 79K
    Post edited by vsvinti on
  • vsvintivsvinti Posts: 47Member
    edited November 2013

    Update: I have re-done this a second time with 2.5, but using the recent recommended options. It's looking good! Two more questions (sorry) :

    • Since I am back to using percentBad and numNumBad, I can't find if the recommended options were -percentBad 0.01 -minNumBad 1000 (and whether I need to increase/experiment with minNumBad similarly to the new numBad?)
    • You say that the order of the -an options is important. I can presumably get the standard deviations for these from the recal files (?), and use this to order the options and rerun. Do they need to be sorted in increasing or decreasing order standard deviation?

    Many thanks!

    As a side, let me know if you are finding the same issues with 2.7 and when a patch might be available..

    Post edited by vsvinti on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hmm, that does look like an issue with the released changes. Now, the main VQSR developer has been working hard on improvements to fix a number of issues that have popped up. You might try the latest nightly build and see if the dev version works better on your data. Sorry for the inconvenience, this is a complex piece of software and it takes a lot of experimenting to get it right.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    That's understandable. I will try the nightly build and let you know how I get on. In the mean time, I look forward to your thoughts on the post above about optimal options for VQSR 2.5.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Oh right; yes you can try increasing percentBad and minNumBad. Regarding the ordering of the an options, I'm not sure, but I'll ask.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Ok, please let me know when you get that info - hopefully soon enough. Appreciate it!

  • vsvintivsvinti Posts: 47Member
    edited November 2013

    Sorry to report that the VQSR in the latest patch (GenomeAnalysisTK-nightly-2013-11-14-g27609c3) still performs badly on my data. It seems like I need to finish this data off with VQSR 2.5. I look forward to hearing on the above order of -an options for 2.5. Thank you.

    vqsr_2.7Nightly14Nov13.png
    605 x 599 - 238K
    Post edited by vsvinti on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    OK, I'll let you know when the new improvements/fix becomes available. Regarding the annotations, the dev tells me that ordering the annotations doesn't actually make the results better, but it makes them more stable across multiple runs. And the idea is to order them in order of increasing std deviation.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    That's great - thank you for clarifying!

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    I have a question about the plots created after VQSR. In my case, I sometimes observe that although the true variants cluster very well they are overlapped by the variants from the negative training. Also, from the area of well clustered true variants I can see that some of them are filtered. Is this acceptable and part of the process or still the VQSR need optimisation? Please see the attached plots. I also can see the same in the plots above in this post. I use the 2.7-2 GATK.

    Thank you in advance

    Dimitar

    pdf
    pdf
    Plots.pdf
    4M
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi @ddzankoff,

    Your plots look really good! Very nice clustering. And yes it is normal to have some overlap of bad variants in the good variants area -- it just means that they were considered bad in other dimensions than those shown in the plots. Based on this I would say your callset is good to go.

    Geraldine Van der Auwera, PhD

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    Thank you for your fast reply. It is big relief to know that. I just used the default parameters and followed the guides from this website. You are doing really very helpful job!

    Thank you again

    Dimitar

  • vsvintivsvinti Posts: 47Member

    Hi there

    Thanks for the VQSR updates in gatk 2.8. I have run this again with v2.5, 2.7 and 2.8 (this time I have ~ 1,000 more samples). V2.7 still looks equally bad, while 2.5 and 2.8 show similar vqsr plots.

    However, for some plots, I seem differences such as the attached, where MQRankSum in 2.8 draws a lot of red points (filtered) to the left of 0, while this is not so evident in 2.5. Which of these describes a better behaviour?

    For the tranches plot, I get TiTv ratios of 1.7 (v2.5) and 1.94 (v2.8). I am dealing with exomes - is this not too low? I have used the --target_titv 3.2 argument for VariantRecalibrator.

    Would appreciate your thoughts once again. Thanks!

    Order of diagrams: plots vqsr 2.5, plots vqsr 2.8, tranches v2.5, tranches v2.8

    vqsr2.5.png
    601 x 596 - 172K
    vqsr2.8.png
    602 x 600 - 149K
    tranch_2.5.png
    910 x 571 - 81K
    tranch_2.8.png
    911 x 567 - 79K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi @vsvinti,

    I'd have to ask @rpoplin for his expert opinion, but my impression is that 2.8 gives you a tighter clustering so I would tend to favor that version.

    Regarding the TiTv it's hard to say as it can depend on your target list.

    By the way, when you say one set is run with 2.5 and the other with 2.8, do you mean that the variants were called and recalibrated with the 2.5 version then again with the 2.8 version, or did you just redo VQSR on an existing set of calls? We don't recommend mixing and matching tools from different GATK versions in the same workflow, so for best results you should redo the calling each time.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Hi Geraldine

    Yes, I had processed the whole workflow with version 2.7-2 (as per previous post). But, because VQSR in that version does not work for me, I did the last step with 2.5 initially, and now also with 2.8. Recalling will take a some time, but I will try it if I get a chance.

    Will wait for poplin to share his ideas on the plots. Thanks!

  • vsvintivsvinti Posts: 47Member
    edited February 3

    Reminder @rpoplin : any thoughts on the plots above?

    Post edited by vsvinti on
  • rpoplinrpoplin Posts: 122GATK Developer mod

    Hi there,

    The model that was learned for ReadPosRankSum versus MQRankSum looks like it was basically doing the right thing in both cases. That is, it looks like it learned that a Gaussian centered at zero is good with probability falling off as you move away from zero.

    I wonder, do the other dimensions look reasonable?

    Can you post the log file from the 2.8 run. There is a lot of diagnostic information in there that will help us debug what might be going on with your exomes.

    Cheers,

  • mprasadmprasad FrancePosts: 8Member

    Hi Geraldine,

    I have a small concern about whether VQSR is working well for me with my parameters. I have a single sample from whole exome sequencing for which I am trying to call variants. Based on the Best Practices recommendation I called variants using HaplotypeCaller for my sample along with 30 other 1000G samples- I got around 159,000 variants. I am now trying to use VQSR on the generated VCF file. Everything runs normally and I get the tranche plots and recal plots. I think my recal plots look decent (please fell free to correct me if I am wrong) but I am not sure about my tranche plots (please find plots attached). I am especially concerned by the fact that the known Ti/Tv is low, between .85 and 1.365 depending on the truth sensitivity threshold. I would like to use a TS sensitivity of 99.0 or 99.9 but am concerned by the extremely low Ti/Tv values that correspond to these tranches- am I going to filter out a lot of novel variants? Is it normal to obtain such low values or am I missing something/doing something wrong? Here is the command line I am using to run VQSR.

    gatk -T VariantRecalibrator -R '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta' -input '/media/Seagate Backup Plus Drive/HDS_05/HC.raw.30s.vcf' -resource:hapmap,known=false,training=true,truth=true,prior=15.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/hapmap_3.3.b37.vcf' -resource:omni,known=false,training=true,truth=true,prior=12.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/1000G_omni2.5.b37.vcf' -resource:1000G,known=false,training=true,truth=false,prior=10.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/1000G_phase1.snps.high_confidence.b37.vcf' -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/dbsnp_138.b37.vcf' -an QD -an MQRankSum -an ReadPosRankSum -an FS -mode SNP -titv 3.2 -recalFile '/media/Seagate Backup Plus Drive/HDS_05/snps.30s.recal' -tranchesFile '/media/Seagate Backup Plus Drive/HDS_05/snps.30s.tranches' -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -rscriptFile '/media/Seagate Backup Plus Drive/HDS_05/recalibrate_SNP_plots.R'

    I used the -titv argument to set this at 3.2 as I am using exome sequencing samples. I'd really appreciate your thoughts. Thank you for your help, Megana

    pdf
    pdf
    recalibrate_SNP_plots.R.pdf
    7M
    pdf
    pdf
    snps.30s.tranches.pdf
    8K
    txt
    txt
    snps.30s.tranches.txt
    563B
  • vsvintivsvinti Posts: 47Member
    edited February 11

    Hi @rpoplin

    Here are all the vqsr plots for 2.8.

    By log file, do you mean the recal file generated with the -recalFile command?

    Thanks,

    Vicky

    @rpoplin said: .. I wonder, do the other dimensions look reasonable?

    Can you post the log file from the 2.8 run. There is a lot of diagnostic information in there that will help us debug what might be going on with your exomes.

    pdf
    pdf
    test5.SNP.plots.r.pdf
    5M
    Post edited by vsvinti on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    I believe @rpoplin means the console output, which can optionally be saved to a log file using the --log_to_file option.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Log file is too big to attach here - where should I put it for you, @rpoplin ?

  • vsvintivsvinti Posts: 47Member

    As an aside, I have been comparing the distribution of the various annotation for the PASS filtered variants (by vqsr) with the recommended hard filters.

    They are mostly ok, but I get a spike of variants with QD < 2 that have the PASS filter. When I compare the Ti/Tv ratios of PASS variants with QD < 2 with those with QD >= 2, there is a big difference. I ended up excluding those with QD < 2, but wondered why VQSR keeps these variants. What are your thoughts?

    TiTvVariantEvaluator  CompRod  EvalRod  JexlExpression  Novelty  nTi     nTv     tiTvRatio  nTiInComp  nTvInComp  TiTvRatioStandard  nTiDerived  nTvDerived  tiTvDerivedRatio
    TiTvVariantEvaluator  dbsnp    QDless2  none            all       14698   24352       0.60   30214029   15253848               1.98           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDless2  none            known      2478    1547       1.60       3069       1330               2.31           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDless2  none            novel     12220   22805       0.54   30210960   15252518               1.98           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDmore2  none            all      520893  202375       2.57   30213889   15253801               1.98           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDmore2  none            known    265420   88415       3.00     263818      88290               2.99           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDmore2  none            novel    255473  113960       2.24   29950071   15165511               1.97           0           0              0.00
    
  • ChrisPattersonChrisPatterson Epilepsy Genomics CenterPosts: 7Member

    Hello, I am brand new to using GATK. We've recently collected WES on affected members and married-in controls of several large families of different ethnic origins. We have 6-8 samples per family. I read on VSQR documentation that it works best on when the sample size is greater than 30, so I was planning to download some of the BAM files from 1000 Genomes as you suggested. What considerations should I make during sample selection to provide the best VQSR results, such as matching samples that use the same Machine, Aligner, Target regions, ethnic background, etc?

    Thanks, Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    Hi @vsvinti,

    You can upload files to our FTP server (see FAQ here for details).

    I'm not sure when Ryan (@rpoplin) will have the time to answer you, because he is scheduled to travel in a few days and has a lot of high-priority work to finish before then.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,224Administrator, GATK Developer admin

    @ChrisPatterson You're on the right track -- same target regions, definitely, to maximize overlap; same machine/ seq. tech & aligner help ensure error modes are similar; same ethnic background is preferred. The closest you can get to matching your samples, the better.

    Geraldine Van der Auwera, PhD

  • mprasadmprasad FrancePosts: 8Member

    Hi Geraldine,

    I realize that you are too busy to have a look at everyone's VQSR plot, so I'll abridge my earlier question- pertaining to my prior post I just wanted to make sure that having a tranche plot where the TS thresholds of 90, 99, 99.9 and 100 have corresponding Ti/Tv values much lower than expected for exome sequencing (bet. .85 and 1.36) isn't a cause for concern, especially if the recal plots look good. If this is unusual, I'd appreciate some suggestions on where to begin tweaking the variant calling/recalibration steps. This is my first time using VQSR so I'd like to make sure all is well before proceeding.

    Thank you for your help, megana

  • vsvintivsvinti Posts: 47Member

    @rpoplin , I have uploaded vsvinti_test5VQSR.logfile.log to the ftp server. I will wait a couple of weeks for response. Thanks.

Sign In or Register to comment.