Service note: Geraldine is on vacation this week; other members of GSA will be responding to questions, but they have a lot of work besides this, so be aware that responses may be a little slower than usual. Thank you for your patience.

Variant Quality Score Recalibration (VQSR)

rpoplinrpoplin Posts: 91GSA Official Member mod

Slides that explain the VQSR methodology as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

Introduction

The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model. The variant recalibrator contrastively evaluates variants in a two step process:

  • VariantRecalibration - Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants.
  • ApplyRecalibration - Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the lod threshold as provided by the user (with the ts_filter_level parameter).

Recalibration tutorial with example HiSeq, single sample, deep coverage, whole genome call set

By way of explaining how one uses the variant quality score recalibrator and evaluating its performance we have put together this tutorial which uses example sequencing data produced at the Broad Institute. All of the data used in this tutorial is available in VCF format from our GATK resource bundle.

Input call set

input: NA12878.HiSeq.WGS.bwa.cleaned.raw.b37.subset.vcf

  • These calls were generated with the UnifiedGenotyper from a 30X coverage modern, single sample run of HiSeq. They were randomly downsampled to keep the file size small but in general one would want to use the full set of variants available genome-wide for this procedure. No other pre-filtering steps were applied to the raw output.

Training sets

HapMap 3.3: hapmap_3.3.b37.sites.vcf

  • These high quality sites are used both to train the Gaussian mixture model and then again when choosing a LOD threshold based on sensitivity to truth sites.
  • The parameters for these sites will be: known = false, training = true, truth = true, prior = Q15 (96.84%)

Omni 2.5M chip: 1000G_omni2.5.b37.sites.vcf

  • These polymorphic sites from the Omni genotyping array are used when training the model.
  • The parameters for these sites will be: known = false, training = true, truth = false, prior = Q12 (93.69%)

dbSNP build 132: dbsnp_132.b37.vcf

  • The dbsnp sites are generally considered to be not of high enough quality to be used in training but here we stratify output metrics such as ti/tv ratio by presence in dbsnp (known sites) or not (novel sites).
  • The parameters for these sites will be: known = true, training = false, truth = false, prior = Q8 (84.15%)

The default prior for all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the UnifiedGenotyper is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

VariantRecalibrator

Detailed information about command line options for VariantRecalibrator can be found here.

Build a Gaussian mixture model using a high quality subset of the input variants and evaluate those model parameters over the full call set. The following notes describe the appropriate inputs to use for this tool.

  • Note that this walker expects call sets in which each record has been appropriately annotated (see e.g. VariantAnnotator). Input call set rod bindings must start with "input". See the command line below.
  • When constructing an initial call set (see e.g. Unified Genotyper or Haplotype Caller) for use with the Recalibrator, it's generally best to turn down the confidence threshold to allow more borderline calls (trusting the Recalibrator to keep the real ones while filtering out the false positives). For example, we often use a Q20 threshold on our deep coverage calls with the Recalibrator (whereas the default threshold in the UnifiedGenotyper is Q30).
  • No pre-filtering is necessary when using the Recalibrator. See below for the advanced options which allow the user to selectively ignore certain filters if they have already been applied to your call set.
  • The tool accepts any ROD bindings when specifying the set of truth sites to be used during modeling. Information about how to download VCF files which we routinely use for training is in the FAQ section at the bottom of the page.
  • Each training set ROD binding is specified with key-value tags to qualify whether the set should be considered as known sites, training sites, and/or truth sites. Additionally, the prior probability of being true for those sites is also specified via these tags in Phred scale. See the command line below for an example. An explanation for how each of the training sets is used by the algorithm:
  • Training sites: Input variants which are found to overlap with these training sites are used to build the Gaussian mixture model.
  • Truth sites: When deciding where to set the cutoff in VQSLOD sensitivity to these truth sites is used. Typically one might want to say I dropped my threshold until I got back 99% of HapMap sites, for example.
  • Known sites: The known / novel status of a variant isn't used by the algorithm itself and is only used for reporting / display purposes. The output metrics are stratified by known status in order to aid in comparisons with other call sets.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

imageGaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The first tranche is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibator walker, is shown on the right.

imageTranches plot for example HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

  • The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric [YES!]
  • The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
  • The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

ApplyRecalibration

Detailed information about command line options for ApplyRecalibration can be found here.

Using the tranche file generated by the previous step the ApplyRecalibration walker looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their filter field annotated with its tranche level. This will result in a call set that simultaneously is filtered to the desired level but also has the information necessary to pull out more variants at a slightly lower quality level.

Frequently Asked Questions

How do I know which annotations to use for my data?

The five annotation values provided in the command lines above (QD, HaplotypeScore, MQRankSum, ReadPosRankSum, and HRun) have been show to give good results for a variety of data types. However this shouldn't be taken to mean these annotations give the absolute best modeling for every source of sequencing data. Better results could possibly be achieved through experimentation with which SNP annotations are used in the algorithm. The goal is to find annotation values with are approximately Gaussianly distributed and also serve to separate the probably true (known) SNPs from the probably false (novel) SNPs.

How do I know which -tranche arguments to pass into the VariantRecalibrator step?

The -tranche arguments main purpose is to create the tranche plot (as shown above). They are meant to convey the idea that with real, calibrated variant quality scores one can create call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way an end user can choose to use some of the filtered records or only use the PASSing records. For new users to the variant quality score recalibrator perhaps the easiest thing to do in the beginning is simply select the single desired false discovery rate and pass that value in as a single -tranche argument to make sure that the desired rate can be achieved given the other parameters to the algorithm.

What should I use as training data?

The VariantRecalibrator step accept lists of truth and training sites in several formats (dbsnp ROD, VCF, and BED, for example). Any list can be used but it is best to use only those sets which are of the best quality. The truth sets are passed into the algorithm using any rod binding name and their truth or training status is specified with rod tags (see VariantRecalibrator section above). We routinely use the HapMap v3.3 VCF file and the Omni2.5M SNP chip array in training the model. In general the false positive rate of dbsnp sites is too high to be used reliably for training the model.

HapMap v3.3 as well as the Omni validation array VCF files are available in our GATK resource bundle.

Does the VQSR work with non-human variant calls?

Absolutely! The VQSR accepts any list of sites to use as training / truth data, not just HapMap.

Don't have any truth data for your organism? No problem. There are several things one might experiment with. One idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP caller, of which the GATK is one, and use those sites which are concordant between the different methods as truth data. There are many fruitful avenues of research here. Hopefully the model reporting plots help facilitate this experimentation. Perhaps the best place to begin is to use a line like the following when specifying the truth set:

-resource:concordantSet,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf

Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties. One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line.

percentBadVariants is the proportion of variants that the program will use to build the negative model. If you have a small callset, you need to use a larger proportion in order to have enough "bad" variants to build the negative model. maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.

Post edited by Geraldine_VdAuwera on

Comments

  • lbernalberna Posts: 5Member

    Hi Geraldine, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...

    I am working with each strain as different organism, so I have good coverage (80X) but only one Lane

    I tried with:

    java -Xmx4g -jar GenomeAnalysisTK.jar  -R S288c.fasta -T VariantFiltration  --variant $1.raw.vcf  --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 "  --filterName "hardtovalidate"   -o $1.filt.vcf
    

    to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi there,

    The exact values can depend a lot on your data set. You'll need to experiment a little and see what works best for you. Two ways to check if you are getting good variants is to look them up in a genome viewer like IGV, and to call variants with a different program to see what's called in both.

    Good luck!

    PS: you have a typo in your command line, I think you mean MQRankSum rather than MQEankSum.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    In addition, you should have a look at the hard-filtering parameters given for this purpose in the Best Practices document.

    http://www.broadinstitute.org/gatk/guide/article?id=1186

    Geraldine Van der Auwera, PhD

  • FrancoisGuillaumeFrancoisGuillaume Posts: 3Member

    Hi, I tried the command indicated in the FAQ "Does the VQSR work with non-human variant calls?"

    --B:concordantSet,VCF,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf

    But the argument doesn't seems to be recognized (at least with gatk-2-2.5). If this is a typo or depreciated options, could it corrected ? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    That specific command syntax is deprecated. The values are still applicable, you just need to pass them using the new syntax. Examples of the new syntax is given here:

    http://www.broadinstitute.org/gatk/guide/article?id=1259

    We are periodically reviewing and updating the entire documentation; this will be corrected when we come to it.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Posts: 29Member

    Just to clarify will I have problems with VQSR if the number of input samples (4 WES) rather than sites is small?

    Post edited by TechnicalVault on

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    It really depends on the total number of sites you have. You may be okay if your samples have a lot of variation. The best way to know is to try and see if the recalibrator complains...

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Hi:

    Just a quick question. For VariantRecalibrator, the command is described as below on the documentation:

    java -Xmx4g -jar GenomeAnalysisTK.jar \
       -T VariantRecalibrator \
       -R reference/human_g1k_v37.fasta \
       -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
       -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
    ......
    

    I used old version of GATK, for the last option -resource:dbsnp,, the prior used to be 8.0, now it is prior=6.0 for the new version, is this intentional or just a typo? or due to the change of the dbsnp version as 135 instead of 132?

    Thanks a lot

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    I'll check this with @rpoplin, who knows this stuff better than anyone, but I'm pretty sure it's intentional and linked to the dbsnp version.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Hi, Geraldine:

    Thank you very much! Look forward to that!!

    I also have a related question also about VQSR, For the new version, the above command is mainly for SNP's VQSR step, I did see some description on Indel VQSR, but not much details or example commands like the ones I pasted above..so my question is: for Indel's VQSR, what training data I shall use and more importantly, what prior values I shall use for each of them. My indel callset has about 4000 of them, looks like a workable number for VQSR step based on the VQSR documentation, right?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    OK, I checked and that is indeed the case -- the "8" prior is what was used for the tutorial given on the VQSR page, which uses dbsnp version 132, while the updated Best Practices use dbsnp version 135 with prior=6. FYI, the hierarchy of documents is that the Best Practices is always the most up-to-date, so when in doubt, use whatever is given there. Next are FAQs, which we do update as needed. Unfortunately we don't have the time to update all tutorials and examples (such as you find in the VQSR article), so those may not use the latest values and command lines. That is why it is important to check the Best Practices.

    For indel resource recommendations, see the before-last paragraph in this doc:

    http://www.broadinstitute.org/gatk/guide/article?id=1247

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Hi, Dear Geraldine:

    Thanks so much for the info and detailed answers, Appreciated very much!

    Just want to confirm my last question on indel using VQSR: based on that doc from your link above

    -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf

    is the only resources I needed for the VQSR step? It is equivalent as the hapmap resources for SNPs' VQSR. the reason I am asking is for SNPs, there are two additional resources as below: -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \

    with known and training and truth setting at different levels using Omni and dbSNP, whereas for indel we only need one setting as known=false,training=true,truth=true using the gold.standard.indel, is my understanding correct?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Mike,

    That's correct, for indels the Mills set is the only resource for which we have a good evaluation of quality, and so it's the only one we use for VQSR.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Thanks again, Geraldine!

    I just finished the indel VQSR and the tranches plot is kind of odd although Gaussian mixture model plots seem not so bad. So I have to come back to ask your opinion on this.

    Here is the actual command I used:

    java -Xmx4g -jar /Path/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar  
    -T VariantRecalibrator 
    -R /Path/hg19.fa 
    -input /Path/GATK_UG_SelIndel.vcf 
    -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 /Path/bundle-1.5/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf 
    -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff  
    --maxGaussians 4 
    -recalFile /Path/GATK_UG_SelIndel.VarRecal 
    -tranchesFile /Path/GATK_UG_SelIndel.tranches 
    -rscriptFile /Path/GATK_UG_SelIndel.R 
    -mode INDEL
    

    I also attached the pdf file and image for the tranches plot as below

    image

    Here is my tranches file:

    # Variant quality score tranches file
    # Version number 5
    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,0,2997,0.0000,0.0000,0.3232,VQSRTrancheINDEL0.00to90.00,INDEL,2724,2451,0.8998
    99.00,0,4033,0.0000,0.0000,-0.7785,VQSRTrancheINDEL90.00to99.00,INDEL,2724,2696,0.9897
    99.90,0,4517,0.0000,0.0000,-10.5409,VQSRTrancheINDEL99.00to99.90,INDEL,2724,2721,0.9989
    100.00,0,4539,0.0000,0.0000,-52.4947,VQSRTrancheINDEL99.90to100.00,INDEL,2724,2724,1.0000
    

    My question would be: why the tranches plot looks like no TP at all, however, at 0,99 or 0.90, the result vcf files (after ApplyRecalibration) both have many "Good" indels with "PASS" at the column "FILTER"?

    Thanks a lot for your help!

    Mike

    pdf
    pdf
    GATK_UG_SelIndel.tranches.pdf
    8K
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Mike,

    In fact the tranches plot is not relevant for indels; in the next release we are going to disable the generation of this pdf in indel mode since it is confusing. The proper thing to look at for indels is the Gaussian mixture model plots, so if yours look ok you're probably fine.

    Post edited by rpoplin on

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Hi, Geraldine:

    Thanks for the info!

    Just in case, since I did see difference of the plots compared to those from SNPs' VQSR, here I attached the pdf plot, could you take a look at it and see if it is OK?

    Thanks

    Mike

    pdf
    pdf
    GATK_UG_SelIndel.R.pdf
    17M
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Sorry Mike, I can't comment on individual results unless there's some evidence of a bug; otherwise everyone will want to have us smell-check their results, and we just don't have the time for that...

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Thanks and that's fine. The only reason I asked is that this is my first time use VQSR on indels, not sure how the plots supposed to look like,. It would be nice if there are some comments or suggestions on the indel plots from VQSR modeling somewhere in the documentation. Thanks again, Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Mike,

    We've put adding an indel example to the VQSR doc on our TODO list. In the meantime, all I can say is that indel plots aren't expected to be very different from SNP plots, generally speaking.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 65Member

    Sounds great! Thank you, Geraldine! Mike

  • ddzankoffddzankoff Posts: 6Member

    Hi Geraldine,

    I am reading the above posts and I can see that you are busy but I cannot solve at present the following problem in my VQSR: the tranches plot seems incorrect (bar length is incorrect I think) although the tranches file may be OK. Also I wonder if somewhere the false positive rate values for every tranche are printed? The data are from human exome sequencing.

    This is the tranches file:

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,15999,75,3.1949,2.5714,0.7305,VQSRTrancheSNP0.00to90.00,SNP,15635,14071,0.9000 99.00,18643,334,3.0884,1.6094,-6.7607,VQSRTrancheSNP90.00to99.00,SNP,15635,15478,0.9900 99.90,19279,377,3.0173,1.4013,-40.2908,VQSRTrancheSNP99.00to99.90,SNP,15635,15619,0.9990 100.00,19430,400,2.9856,1.2857,-27457.4005,VQSRTrancheSNP99.90to100.00,SNP,15635,15635,1.0000

    Please see the attached pdf file of tranches plot.

    Thank you in advance for your answer

    Sincerely Dimitar Zankov dzankoff@belle.shiga-med.ac.jp

    pdf
    pdf
    SNPs_VQSR.tranches.pdf
    8K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Dimitar

    Can you tell me which version you are using? If not the latest, can you please try again with the latest version and let me know if the problem persists?

    Geraldine Van der Auwera, PhD

  • ddzankoffddzankoff Posts: 6Member

    Hi Geraldine,

    Thank you very much for the answer! I think I am using the latest GATK v2.3-9-ge5ebf34. If I can find out the false positive rate (or calculate somehow) for every tranche I can build the plots by myself, I guess. Could you tell me how to do that (if it is not big trouble or too complicated)?

    Thank you

    DImitar

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Dimitar,

    After consulting the developer, it seems that the most probable reason your tranches plot looks weird is because you are using the default Ti/Tv ratio, which is the value estimated for whole genomes. The value of the expected Ti/TV ratio is used to calculate the FP rate. If you are working with exomes, you need to change that value (using the --target_titv argument described in the Technical Documentation page). You'll need to decide which --target_titv ratio is most appropriate for your experimental design.

    Good luck!

    Geraldine Van der Auwera, PhD

  • ddzankoffddzankoff Posts: 6Member

    Hi Geraldine,

    Thank you very much for your very kind help and detailed explanation! I will try what you suggested. I appreciate very much the GATK and dedicated team behind it!

    Thank you again Dimitar

  • ddzankoffddzankoff Posts: 6Member

    Hi Geraldine,

    Just to confirm that your advice works. Please see the pdf file. Tranches look OK (I hope) at TiTv set to 2.8. Cannot say how grateful I am.

    Dimitar

    pdf
    pdf
    VQSR.tranches.pdf
    21K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Thanks for confirming, Dimitar, I'm glad to hear it worked for you.

    Geraldine Van der Auwera, PhD

  • simonwilliamssimonwilliams Posts: 6Member

    Hi, I'm also having problems that can be seen in the tranches plots. I'm running UnifiedGenotyper on 30 exomes. My VariantRecalibrator command is:

    -input snps.raw.vcf -R /path/to/hg19.fa -T VariantRecalibrator -mG 6 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/hapmap.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /path/to/omni.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /path/to/dbsnp.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode SNP -recalFile combined_snps.recal -tranchesFile combined_snps.tranches -rscriptFile combined_snps.plots.R

    The plots seem to be formatted incorrectly - probably due to my own errors rather than a bug - but I am getting far too many FPs. Could this be due to something obvious that I'm doing wrong? Thanks for any pointers, Simon

    pdf
    pdf
    combined_snps.tranches.pdf
    8K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Simon,

    It might be the same problem as Dimitar had; see my explanation about the Ti/Tv ratio above. Try adjusting that and let me know if you're still having issues.

    Geraldine Van der Auwera, PhD

  • simonwilliamssimonwilliams Posts: 6Member

    Thanks for the quick response Geraldine,

    I added the parameter '--target_titv 2.8' as suggested by Dimitar but it doesn't seem to make any difference to the plot.

    The top plot seems to be formatted incorrectly as the order of truth tranches is reversed from the other plots displayed on this page i.e. 100, 99.9, 99, 90 (top to bottom) as opposed to 90, 99, 99.9, 100. Is this something I can control? And should I be concerned by the potentially low ti/tv ratios?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Simon, sorry for the delay.

    Could you post the log from the VariantRecalibrator run and the contents of the tranches files?

    Geraldine Van der Auwera, PhD

  • simonwilliamssimonwilliams Posts: 6Member

    Thanks for looking at this. Here is the contents of the tranches file:

    Variant quality score tranches file

    Version number 5

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,3153120,508705,2.1208,0.7094,5.7019,VQSRTrancheSNP0.00to90.00,SNP,1771460,1594314,0.9000 99.00,3776458,667766,2.1336,0.7385,1.7651,VQSRTrancheSNP90.00to99.00,SNP,1771460,1753745,0.9900 99.90,4062619,763129,2.1068,0.7733,-4.6855,VQSRTrancheSNP99.00to99.90,SNP,1771460,1769688,0.9990 100.00,4192268,848731,2.0847,0.8108,-7549.2384,VQSRTrancheSNP99.90to100.00,SNP,1771460,1771460,1.0000

    See attached for the variantRecalibrator log file.

    Simon

    log
    log
    variantRecalibrator.log
    17K
  • rpoplinrpoplin Posts: 91GSA Official Member mod

    Hi there,

    We've taken a look at the logs and don't see any red flags which might explain this behavior. I wonder, can you please tell us more about how your input bams and variant calls were generated? This seems like way too many SNP calls to come from only 30 human exomes.

    Thanks,

  • simonwilliamssimonwilliams Posts: 6Member

    I think I might have an idea now. I have been splitting the original bams into separate chromosomes using samtools and running UG over each. I haven't been specifying the chromosome with -L (I didn't think I'd need to) but looking at the logs for these, the progress meter shows that it is walking over them all. I'm guessing that this why I have so many snp calls and that this is what is messing with my tranches. I'm running it again in the correct way and will let you know the outcome. Thanks

  • hmk123hmk123 Posts: 9Member

    I may be overlooking something, but is there a way to get a "clean" VCF file out after VQSR? By this I mean a VCF file where variants that are filtered by this are removed (not just marked as problematic)? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    No, that's not possible. The point of VQSR is that you can fine-tune recalibration by making several passes over the same file. If it were to actively remove variants it would break the fine-tuning functionality. To produce a "clean" VCF as you describe, just do an additional step with SelectVariants.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 23Member

    This might be a silly question, but I was wondering whether having some PASS variants with a negative VQSLOD score was normal. In particular, I see this for indels that are called. They aren't very large negative numbers but still this would suggest a greater probability that the variant being called is false vs. true, so I'm just wondering why these would not be filtered ? Are there any special cases?

    Thanks,

    MC

  • chongmchongm Posts: 23Member

    Also I used a truth sensitivity level of 95% as recommended for indels.

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi there, sorry it took me a while to get back to you on this.

    Yes, it's ok to have PASSing variants with a negative VQSLOD. The threshold for filtering is based on the truth sensitivity and not the VQSLOD, so if you asked for a larger truth sensitivity then that would dip down lower and lower on the VQSLOD scale.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 23Member

    Hi Geraldine,

    I see, so do you recommend adjusting the truth sensitivity until there are no negative VQSLOD scores?

    Thanks,

    MC

  • chongmchongm Posts: 23Member

    Would this be the equivalent of just filtering out negative VQSLOD scores?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    No, you just shouldn't worry about negative VQSLODs. Truth sensitivity is a better criterium.

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 23Member

    For the tranches plot file, how are the number of true positive and false positive variants determined for novel variants?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Those are estimated based on the deviation from the specified known ti/tv ratio.

    Geraldine Van der Auwera, PhD

  • hmk123hmk123 Posts: 9Member

    I have a question about small sample sizes. We have run the VariantRecalibrator for a small targeted resequencing project. I have added this to the command line: --maxGaussians 4 --percentBad 0.05, but I still get a warning. I have tried to add .bam files from 1000s genomes (in the Unified Genotyper step), but I still get warnings. We have another 200 subjects that we have sequencing from. But our original data was from SOLiD but the new data is Illumnia. Can I put 2 different vcfs in to recalibrate? Because I don't think I would be able to put bam files from SOLiD and Illumnia into the Unified Genotyper? But please correct me if I am wrong.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi there,

    Mixing technologies is generally speaking a bad idea, but if you have 200 samples with each then it should be fine because the program will have enough data to learn clusters specific to those samples in addition to learning the clusters that are shared between the technologies.

    As for how to proceed, it doesn't really matter whether you input the two VCFs directly into the VR or whether you give all 400 bams to the UG and make one callset. Just make sure that the callsets are generated using the same calling parameters, so that the call stats are comparable.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.