It looks like you're new here. If you want to get involved, click one of these buttons!
rpoplin
Posts: 92GSA Official Member mod
Slides that explain the VQSR methodology as well as the individual component variant annotations can be found here in the GSA Public Drop Box.
Detailed information about command line options for VariantRecalibrator can be found here.
Detailed information about command line options for ApplyRecalibration can be found here.
The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call.
The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model. The variant recalibrator contrastively evaluates variants in a two step process:
By way of explaining how one uses the variant quality score recalibrator and evaluating its performance we have put together this tutorial which uses example sequencing data produced at the Broad Institute. All of the data used in this tutorial is available in VCF format from our GATK resource bundle.
The default prior for all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the UnifiedGenotyper is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.
Detailed information about command line options for VariantRecalibrator can be found here.
Build a Gaussian mixture model using a high quality subset of the input variants and evaluate those model parameters over the full call set. The following notes describe the appropriate inputs to use for this tool.
The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.
Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.
The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).
An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!
The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The first tranche is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibator walker, is shown on the right.
Tranches plot for example HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:
We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.
Detailed information about command line options for ApplyRecalibration can be found here.
Using the tranche file generated by the previous step the ApplyRecalibration walker looks at each variant's VQSLOD value and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level have their filter field annotated with its tranche level. This will result in a call set that simultaneously is filtered to the desired level but also has the information necessary to pull out more variants at a slightly lower quality level.
The five annotation values provided in the command lines above (QD, HaplotypeScore, MQRankSum, ReadPosRankSum, and HRun) have been show to give good results for a variety of data types. However this shouldn't be taken to mean these annotations give the absolute best modeling for every source of sequencing data. Better results could possibly be achieved through experimentation with which SNP annotations are used in the algorithm. The goal is to find annotation values with are approximately Gaussianly distributed and also serve to separate the probably true (known) SNPs from the probably false (novel) SNPs.
The -tranche arguments main purpose is to create the tranche plot (as shown above). They are meant to convey the idea that with real, calibrated variant quality scores one can create call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way an end user can choose to use some of the filtered records or only use the PASSing records. For new users to the variant quality score recalibrator perhaps the easiest thing to do in the beginning is simply select the single desired false discovery rate and pass that value in as a single -tranche argument to make sure that the desired rate can be achieved given the other parameters to the algorithm.
The VariantRecalibrator step accept lists of truth and training sites in several formats (dbsnp ROD, VCF, and BED, for example). Any list can be used but it is best to use only those sets which are of the best quality. The truth sets are passed into the algorithm using any rod binding name and their truth or training status is specified with rod tags (see VariantRecalibrator section above). We routinely use the HapMap v3.3 VCF file and the Omni2.5M SNP chip array in training the model. In general the false positive rate of dbsnp sites is too high to be used reliably for training the model.
HapMap v3.3 as well as the Omni validation array VCF files are available in our GATK resource bundle.
Absolutely! The VQSR accepts any list of sites to use as training / truth data, not just HapMap.
Don't have any truth data for your organism? No problem. There are several things one might experiment with. One idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP caller, of which the GATK is one, and use those sites which are concordant between the different methods as truth data. There are many fruitful avenues of research here. Hopefully the model reporting plots help facilitate this experimentation. Perhaps the best place to begin is to use a line like the following when specifying the truth set:
-resource:concordantSet,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf
This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.
One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line.
percentBadVariants is the proportion of variants that the program will use to build the negative model. If you have a small callset, you need to use a larger proportion in order to have enough "bad" variants to build the negative model. maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.
The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.
Comments
Hi Geraldine, I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...
I am working with each strain as different organism, so I have good coverage (80X) but only one Lane
I tried with:
to remove after the LowQual and hardtovalidate snps, that make sense? thanks for your help!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
The exact values can depend a lot on your data set. You'll need to experiment a little and see what works best for you. Two ways to check if you are getting good variants is to look them up in a genome viewer like IGV, and to call variants with a different program to see what's called in both.
Good luck!
PS: you have a typo in your command line, I think you mean
MQRankSumrather thanMQEankSum.Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •In addition, you should have a look at the hard-filtering parameters given for this purpose in the Best Practices document.
http://www.broadinstitute.org/gatk/guide/article?id=1186
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I tried the command indicated in the FAQ "Does the VQSR work with non-human variant calls?"
--B:concordantSet,VCF,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf
But the argument doesn't seems to be recognized (at least with gatk-2-2.5). If this is a typo or depreciated options, could it corrected ? Thanks.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •That specific command syntax is deprecated. The values are still applicable, you just need to pass them using the new syntax. Examples of the new syntax is given here:
http://www.broadinstitute.org/gatk/guide/article?id=1259
We are periodically reviewing and updating the entire documentation; this will be corrected when we come to it.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Just to clarify will I have problems with VQSR if the number of input samples (4 WES) rather than sites is small?
Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •It really depends on the total number of sites you have. You may be okay if your samples have a lot of variation. The best way to know is to try and see if the recalibrator complains...
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi:
Just a quick question. For VariantRecalibrator, the command is described as below on the documentation:
I used old version of GATK, for the last option -resource:dbsnp,, the prior used to be 8.0, now it is prior=6.0 for the new version, is this intentional or just a typo? or due to the change of the dbsnp version as 135 instead of 132?
Thanks a lot
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I'll check this with @rpoplin, who knows this stuff better than anyone, but I'm pretty sure it's intentional and linked to the dbsnp version.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Geraldine:
Thank you very much! Look forward to that!!
I also have a related question also about VQSR, For the new version, the above command is mainly for SNP's VQSR step, I did see some description on Indel VQSR, but not much details or example commands like the ones I pasted above..so my question is: for Indel's VQSR, what training data I shall use and more importantly, what prior values I shall use for each of them. My indel callset has about 4000 of them, looks like a workable number for VQSR step based on the VQSR documentation, right?
Thanks again!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •OK, I checked and that is indeed the case -- the "8" prior is what was used for the tutorial given on the VQSR page, which uses dbsnp version 132, while the updated Best Practices use dbsnp version 135 with prior=6. FYI, the hierarchy of documents is that the Best Practices is always the most up-to-date, so when in doubt, use whatever is given there. Next are FAQs, which we do update as needed. Unfortunately we don't have the time to update all tutorials and examples (such as you find in the VQSR article), so those may not use the latest values and command lines. That is why it is important to check the Best Practices.
For indel resource recommendations, see the before-last paragraph in this doc:
http://www.broadinstitute.org/gatk/guide/article?id=1247
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Dear Geraldine:
Thanks so much for the info and detailed answers, Appreciated very much!
Just want to confirm my last question on indel using VQSR: based on that doc from your link above
-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf
is the only resources I needed for the VQSR step? It is equivalent as the hapmap resources for SNPs' VQSR. the reason I am asking is for SNPs, there are two additional resources as below: -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
with known and training and truth setting at different levels using Omni and dbSNP, whereas for indel we only need one setting as known=false,training=true,truth=true using the gold.standard.indel, is my understanding correct?
Thanks again!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Mike,
That's correct, for indels the Mills set is the only resource for which we have a good evaluation of quality, and so it's the only one we use for VQSR.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks again, Geraldine!
I just finished the indel VQSR and the tranches plot is kind of odd although Gaussian mixture model plots seem not so bad. So I have to come back to ask your opinion on this.
Here is the actual command I used:
I also attached the pdf file and image for the tranches plot as below
Here is my tranches file:
My question would be: why the tranches plot looks like no TP at all, however, at 0,99 or 0.90, the result vcf files (after ApplyRecalibration) both have many "Good" indels with "PASS" at the column "FILTER"?
Thanks a lot for your help!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Mike,
In fact the tranches plot is not relevant for indels; in the next release we are going to disable the generation of this pdf in indel mode since it is confusing. The proper thing to look at for indels is the Gaussian mixture model plots, so if yours look ok you're probably fine.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Geraldine:
Thanks for the info!
Just in case, since I did see difference of the plots compared to those from SNPs' VQSR, here I attached the pdf plot, could you take a look at it and see if it is OK?
Thanks
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Sorry Mike, I can't comment on individual results unless there's some evidence of a bug; otherwise everyone will want to have us smell-check their results, and we just don't have the time for that...
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks and that's fine. The only reason I asked is that this is my first time use VQSR on indels, not sure how the plots supposed to look like,. It would be nice if there are some comments or suggestions on the indel plots from VQSR modeling somewhere in the documentation. Thanks again, Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Mike,
We've put adding an indel example to the VQSR doc on our TODO list. In the meantime, all I can say is that indel plots aren't expected to be very different from SNP plots, generally speaking.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Sounds great! Thank you, Geraldine! Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
I am reading the above posts and I can see that you are busy but I cannot solve at present the following problem in my VQSR: the tranches plot seems incorrect (bar length is incorrect I think) although the tranches file may be OK. Also I wonder if somewhere the false positive rate values for every tranche are printed? The data are from human exome sequencing.
This is the tranches file:
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,15999,75,3.1949,2.5714,0.7305,VQSRTrancheSNP0.00to90.00,SNP,15635,14071,0.9000 99.00,18643,334,3.0884,1.6094,-6.7607,VQSRTrancheSNP90.00to99.00,SNP,15635,15478,0.9900 99.90,19279,377,3.0173,1.4013,-40.2908,VQSRTrancheSNP99.00to99.90,SNP,15635,15619,0.9990 100.00,19430,400,2.9856,1.2857,-27457.4005,VQSRTrancheSNP99.90to100.00,SNP,15635,15635,1.0000
Please see the attached pdf file of tranches plot.
Thank you in advance for your answer
Sincerely Dimitar Zankov dzankoff@belle.shiga-med.ac.jp
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Dimitar
Can you tell me which version you are using? If not the latest, can you please try again with the latest version and let me know if the problem persists?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
Thank you very much for the answer! I think I am using the latest GATK v2.3-9-ge5ebf34. If I can find out the false positive rate (or calculate somehow) for every tranche I can build the plots by myself, I guess. Could you tell me how to do that (if it is not big trouble or too complicated)?
Thank you
DImitar
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Dimitar,
After consulting the developer, it seems that the most probable reason your tranches plot looks weird is because you are using the default Ti/Tv ratio, which is the value estimated for whole genomes. The value of the expected Ti/TV ratio is used to calculate the FP rate. If you are working with exomes, you need to change that value (using the
--target_titvargument described in the Technical Documentation page). You'll need to decide which --target_titv ratio is most appropriate for your experimental design.Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
Thank you very much for your very kind help and detailed explanation! I will try what you suggested. I appreciate very much the GATK and dedicated team behind it!
Thank you again Dimitar
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
Just to confirm that your advice works. Please see the pdf file. Tranches look OK (I hope) at TiTv set to 2.8. Cannot say how grateful I am.
Dimitar
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for confirming, Dimitar, I'm glad to hear it worked for you.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I'm also having problems that can be seen in the tranches plots. I'm running UnifiedGenotyper on 30 exomes. My VariantRecalibrator command is:
-input snps.raw.vcf -R /path/to/hg19.fa -T VariantRecalibrator -mG 6 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/hapmap.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /path/to/omni.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /path/to/dbsnp.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode SNP -recalFile combined_snps.recal -tranchesFile combined_snps.tranches -rscriptFile combined_snps.plots.R
The plots seem to be formatted incorrectly - probably due to my own errors rather than a bug - but I am getting far too many FPs. Could this be due to something obvious that I'm doing wrong? Thanks for any pointers, Simon
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Simon,
It might be the same problem as Dimitar had; see my explanation about the Ti/Tv ratio above. Try adjusting that and let me know if you're still having issues.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for the quick response Geraldine,
I added the parameter '--target_titv 2.8' as suggested by Dimitar but it doesn't seem to make any difference to the plot.
The top plot seems to be formatted incorrectly as the order of truth tranches is reversed from the other plots displayed on this page i.e. 100, 99.9, 99, 90 (top to bottom) as opposed to 90, 99, 99.9, 100. Is this something I can control? And should I be concerned by the potentially low ti/tv ratios?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Simon, sorry for the delay.
Could you post the log from the VariantRecalibrator run and the contents of the tranches files?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for looking at this. Here is the contents of the tranches file:
Variant quality score tranches file
Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,3153120,508705,2.1208,0.7094,5.7019,VQSRTrancheSNP0.00to90.00,SNP,1771460,1594314,0.9000 99.00,3776458,667766,2.1336,0.7385,1.7651,VQSRTrancheSNP90.00to99.00,SNP,1771460,1753745,0.9900 99.90,4062619,763129,2.1068,0.7733,-4.6855,VQSRTrancheSNP99.00to99.90,SNP,1771460,1769688,0.9990 100.00,4192268,848731,2.0847,0.8108,-7549.2384,VQSRTrancheSNP99.90to100.00,SNP,1771460,1771460,1.0000
See attached for the variantRecalibrator log file.
Simon
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
We've taken a look at the logs and don't see any red flags which might explain this behavior. I wonder, can you please tell us more about how your input bams and variant calls were generated? This seems like way too many SNP calls to come from only 30 human exomes.
Thanks,
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I think I might have an idea now. I have been splitting the original bams into separate chromosomes using samtools and running UG over each. I haven't been specifying the chromosome with -L (I didn't think I'd need to) but looking at the logs for these, the progress meter shows that it is walking over them all. I'm guessing that this why I have so many snp calls and that this is what is messing with my tranches. I'm running it again in the correct way and will let you know the outcome. Thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I may be overlooking something, but is there a way to get a "clean" VCF file out after VQSR? By this I mean a VCF file where variants that are filtered by this are removed (not just marked as problematic)? Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •No, that's not possible. The point of VQSR is that you can fine-tune recalibration by making several passes over the same file. If it were to actively remove variants it would break the fine-tuning functionality. To produce a "clean" VCF as you describe, just do an additional step with SelectVariants.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •This might be a silly question, but I was wondering whether having some PASS variants with a negative VQSLOD score was normal. In particular, I see this for indels that are called. They aren't very large negative numbers but still this would suggest a greater probability that the variant being called is false vs. true, so I'm just wondering why these would not be filtered ? Are there any special cases?
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Also I used a truth sensitivity level of 95% as recommended for indels.
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there, sorry it took me a while to get back to you on this.
Yes, it's ok to have PASSing variants with a negative VQSLOD. The threshold for filtering is based on the truth sensitivity and not the VQSLOD, so if you asked for a larger truth sensitivity then that would dip down lower and lower on the VQSLOD scale.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
I see, so do you recommend adjusting the truth sensitivity until there are no negative VQSLOD scores?
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Would this be the equivalent of just filtering out negative VQSLOD scores?
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •No, you just shouldn't worry about negative VQSLODs. Truth sensitivity is a better criterium.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •For the tranches plot file, how are the number of true positive and false positive variants determined for novel variants?
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Those are estimated based on the deviation from the specified known ti/tv ratio.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I have a question about small sample sizes. We have run the VariantRecalibrator for a small targeted resequencing project. I have added this to the command line: --maxGaussians 4 --percentBad 0.05, but I still get a warning. I have tried to add .bam files from 1000s genomes (in the Unified Genotyper step), but I still get warnings. We have another 200 subjects that we have sequencing from. But our original data was from SOLiD but the new data is Illumnia. Can I put 2 different vcfs in to recalibrate? Because I don't think I would be able to put bam files from SOLiD and Illumnia into the Unified Genotyper? But please correct me if I am wrong.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
Mixing technologies is generally speaking a bad idea, but if you have 200 samples with each then it should be fine because the program will have enough data to learn clusters specific to those samples in addition to learning the clusters that are shared between the technologies.
As for how to proceed, it doesn't really matter whether you input the two VCFs directly into the VR or whether you give all 400 bams to the UG and make one callset. Just make sure that the callsets are generated using the same calling parameters, so that the call stats are comparable.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •