Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about VQSR

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: Variant Quality Score Recalibration (VQSR).

Comments

  • lbernalberna Member
    edited November 2012

    Hi Geraldine,
    I am working with yeast and I am doing the VariantRecalibrator step, as I dont have a truth data set I want to "filter" my initial round of raw SNP in order to have the highest quality score SNP as you say. I was wondering if you have any suggestion about the parameters of filtration...

    I am working with each strain as different organism, so I have good coverage (80X) but only one Lane

    I tried with:

    java -Xmx4g -jar GenomeAnalysisTK.jar  -R S288c.fasta -T VariantFiltration  --variant $1.raw.vcf  --filterExpression "QD<2.0 || MQ<45.0 || FS>60 || MQEankSum< -12.5 || ReadPosRankSum<-8.0 "  --filterName "hardtovalidate"   -o $1.filt.vcf
    

    to remove after the LowQual and hardtovalidate snps, that make sense?
    thanks for your help!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    The exact values can depend a lot on your data set. You'll need to experiment a little and see what works best for you. Two ways to check if you are getting good variants is to look them up in a genome viewer like IGV, and to call variants with a different program to see what's called in both.

    Good luck!

    PS: you have a typo in your command line, I think you mean MQRankSum rather than MQEankSum.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In addition, you should have a look at the hard-filtering parameters given for this purpose in the Best Practices document.

    http://www.broadinstitute.org/gatk/guide/article?id=1186

  • FrancoisGuillaumeFrancoisGuillaume Jouy en josasMember

    Hi,
    I tried the command indicated in the FAQ "Does the VQSR work with non-human variant calls?"

    --B:concordantSet,VCF,known=true,training=true,truth=true,prior=10.0 path/to/concordantSet.vcf

    But the argument doesn't seems to be recognized (at least with gatk-2-2.5). If this is a typo or depreciated options, could it corrected ?
    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That specific command syntax is deprecated. The values are still applicable, you just need to pass them using the new syntax. Examples of the new syntax is given here:

    http://www.broadinstitute.org/gatk/guide/article?id=1259

    We are periodically reviewing and updating the entire documentation; this will be corrected when we come to it.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭
    edited November 2012

    Just to clarify will I have problems with VQSR if the number of input samples (4 WES) rather than sites is small?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It really depends on the total number of sites you have. You may be okay if your samples have a lot of variation. The best way to know is to try and see if the recalibrator complains...

  • mikemike Member
    edited November 2012

    Hi:

    Just a quick question. For VariantRecalibrator, the command is described as below on the documentation:

    java -Xmx4g -jar GenomeAnalysisTK.jar \
       -T VariantRecalibrator \
       -R reference/human_g1k_v37.fasta \
       -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
       -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
    ......
    

    I used old version of GATK, for the last option -resource:dbsnp,, the prior used to be 8.0, now it is prior=6.0 for the new version, is this intentional or just a typo? or due to the change of the dbsnp version as 135 instead of 132?

    Thanks a lot

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'll check this with @rpoplin, who knows this stuff better than anyone, but I'm pretty sure it's intentional and linked to the dbsnp version.

  • mikemike Member

    Hi, Geraldine:

    Thank you very much! Look forward to that!!

    I also have a related question also about VQSR, For the new version, the above command is mainly for SNP's VQSR step, I did see some description on Indel VQSR, but not much details or example commands like the ones I pasted above..so my question is: for Indel's VQSR, what training data I shall use and more importantly, what prior values I shall use for each of them. My indel callset has about 4000 of them, looks like a workable number for VQSR step based on the VQSR documentation, right?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, I checked and that is indeed the case -- the "8" prior is what was used for the tutorial given on the VQSR page, which uses dbsnp version 132, while the updated Best Practices use dbsnp version 135 with prior=6. FYI, the hierarchy of documents is that the Best Practices is always the most up-to-date, so when in doubt, use whatever is given there. Next are FAQs, which we do update as needed. Unfortunately we don't have the time to update all tutorials and examples (such as you find in the VQSR article), so those may not use the latest values and command lines. That is why it is important to check the Best Practices.

    For indel resource recommendations, see the before-last paragraph in this doc:

    http://www.broadinstitute.org/gatk/guide/article?id=1247

  • mikemike Member

    Hi, Dear Geraldine:

    Thanks so much for the info and detailed answers, Appreciated very much!

    Just want to confirm my last question on indel using VQSR: based on that doc from your link above

    -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf

    is the only resources I needed for the VQSR step? It is equivalent as the hapmap resources for SNPs' VQSR. the reason I am asking is for SNPs, there are two additional resources as below:
    -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
    -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \

    with known and training and truth setting at different levels using Omni and dbSNP, whereas for indel we only need one setting as known=false,training=true,truth=true using the gold.standard.indel, is my understanding correct?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    That's correct, for indels the Mills set is the only resource for which we have a good evaluation of quality, and so it's the only one we use for VQSR.

  • mikemike Member
    edited November 2012

    Thanks again, Geraldine!

    I just finished the indel VQSR and the tranches plot is kind of odd although Gaussian mixture model plots seem not so bad. So I have to come back to ask your opinion on this.

    Here is the actual command I used:

    java -Xmx4g -jar /Path/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar  
    -T VariantRecalibrator 
    -R /Path/hg19.fa 
    -input /Path/GATK_UG_SelIndel.vcf 
    -resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 /Path/bundle-1.5/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf 
    -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff  
    --maxGaussians 4 
    -recalFile /Path/GATK_UG_SelIndel.VarRecal 
    -tranchesFile /Path/GATK_UG_SelIndel.tranches 
    -rscriptFile /Path/GATK_UG_SelIndel.R 
    -mode INDEL
    

    I also attached the pdf file and image for the tranches plot as below

    image

    Here is my tranches file:

    # Variant quality score tranches file
    # Version number 5
    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,0,2997,0.0000,0.0000,0.3232,VQSRTrancheINDEL0.00to90.00,INDEL,2724,2451,0.8998
    99.00,0,4033,0.0000,0.0000,-0.7785,VQSRTrancheINDEL90.00to99.00,INDEL,2724,2696,0.9897
    99.90,0,4517,0.0000,0.0000,-10.5409,VQSRTrancheINDEL99.00to99.90,INDEL,2724,2721,0.9989
    100.00,0,4539,0.0000,0.0000,-52.4947,VQSRTrancheINDEL99.90to100.00,INDEL,2724,2724,1.0000
    

    My question would be:
    why the tranches plot looks like no TP at all, however, at 0,99 or 0.90, the result vcf files (after ApplyRecalibration) both have many "Good" indels with "PASS" at the column "FILTER"?

    Thanks a lot for your help!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited November 2012

    Hi Mike,

    In fact the tranches plot is not relevant for indels; in the next release we are going to disable the generation of this pdf in indel mode since it is confusing. The proper thing to look at for indels is the Gaussian mixture model plots, so if yours look ok you're probably fine.

    Post edited by rpoplin on
  • mikemike Member

    Hi, Geraldine:

    Thanks for the info!

    Just in case, since I did see difference of the plots compared to those from SNPs' VQSR, here I attached the pdf plot, could you take a look at it and see if it is OK?

    Thanks

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sorry Mike, I can't comment on individual results unless there's some evidence of a bug; otherwise everyone will want to have us smell-check their results, and we just don't have the time for that...

  • mikemike Member

    Thanks and that's fine. The only reason I asked is that this is my first time use VQSR on indels, not sure how the plots supposed to look like,. It would be nice if there are some comments or suggestions on the indel plots from VQSR modeling somewhere in the documentation. Thanks again, Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    We've put adding an indel example to the VQSR doc on our TODO list. In the meantime, all I can say is that indel plots aren't expected to be very different from SNP plots, generally speaking.

  • mikemike Member

    Sounds great! Thank you, Geraldine! Mike

  • Hi Geraldine,

    I am reading the above posts and I can see that you are busy but I cannot solve at present the following problem in my VQSR: the tranches plot seems incorrect (bar length is incorrect I think) although the tranches file may be OK. Also I wonder if somewhere the false positive rate values for every tranche are printed? The data are from human exome sequencing.

    This is the tranches file:

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,15999,75,3.1949,2.5714,0.7305,VQSRTrancheSNP0.00to90.00,SNP,15635,14071,0.9000
    99.00,18643,334,3.0884,1.6094,-6.7607,VQSRTrancheSNP90.00to99.00,SNP,15635,15478,0.9900
    99.90,19279,377,3.0173,1.4013,-40.2908,VQSRTrancheSNP99.00to99.90,SNP,15635,15619,0.9990
    100.00,19430,400,2.9856,1.2857,-27457.4005,VQSRTrancheSNP99.90to100.00,SNP,15635,15635,1.0000

    Please see the attached pdf file of tranches plot.

    Thank you in advance for your answer

    Sincerely
    Dimitar Zankov
    [email protected]

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Dimitar

    Can you tell me which version you are using? If not the latest, can you please try again with the latest version and let me know if the problem persists?

  • Hi Geraldine,

    Thank you very much for the answer! I think I am using the latest GATK v2.3-9-ge5ebf34. If I can find out the false positive rate (or calculate somehow) for every tranche I can build the plots by myself, I guess.
    Could you tell me how to do that (if it is not big trouble or too complicated)?

    Thank you

    DImitar

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Dimitar,

    After consulting the developer, it seems that the most probable reason your tranches plot looks weird is because you are using the default Ti/Tv ratio, which is the value estimated for whole genomes. The value of the expected Ti/TV ratio is used to calculate the FP rate. If you are working with exomes, you need to change that value (using the --target_titv argument described in the Technical Documentation page). You'll need to decide which --target_titv ratio is most appropriate for your experimental design.

    Good luck!

  • Hi Geraldine,

    Thank you very much for your very kind help and detailed explanation! I will try what you suggested.
    I appreciate very much the GATK and dedicated team behind it!

    Thank you again
    Dimitar

  • Hi Geraldine,

    Just to confirm that your advice works. Please see the pdf file. Tranches look OK (I hope) at TiTv set to 2.8.
    Cannot say how grateful I am.

    Dimitar

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for confirming, Dimitar, I'm glad to hear it worked for you.

  • Hi,
    I'm also having problems that can be seen in the tranches plots. I'm running UnifiedGenotyper on 30 exomes. My VariantRecalibrator command is:

    -input snps.raw.vcf
    -R /path/to/hg19.fa
    -T VariantRecalibrator
    -mG 6
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/hapmap.vcf
    -resource:omni,known=false,training=true,truth=false,prior=12.0 /path/to/omni.vcf
    -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /path/to/dbsnp.vcf
    -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff
    -mode SNP
    -recalFile combined_snps.recal
    -tranchesFile combined_snps.tranches
    -rscriptFile combined_snps.plots.R

    The plots seem to be formatted incorrectly - probably due to my own errors rather than a bug - but I am getting far too many FPs. Could this be due to something obvious that I'm doing wrong?
    Thanks for any pointers,
    Simon

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Simon,

    It might be the same problem as Dimitar had; see my explanation about the Ti/Tv ratio above. Try adjusting that and let me know if you're still having issues.

  • Thanks for the quick response Geraldine,

    I added the parameter '--target_titv 2.8' as suggested by Dimitar but it doesn't seem to make any difference to the plot.

    The top plot seems to be formatted incorrectly as the order of truth tranches is reversed from the other plots displayed on this page i.e. 100, 99.9, 99, 90 (top to bottom) as opposed to 90, 99, 99.9, 100. Is this something I can control? And should I be concerned by the potentially low ti/tv ratios?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Simon, sorry for the delay.

    Could you post the log from the VariantRecalibrator run and the contents of the tranches files?

  • Thanks for looking at this. Here is the contents of the tranches file:

    Variant quality score tranches file

    Version number 5

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,3153120,508705,2.1208,0.7094,5.7019,VQSRTrancheSNP0.00to90.00,SNP,1771460,1594314,0.9000
    99.00,3776458,667766,2.1336,0.7385,1.7651,VQSRTrancheSNP90.00to99.00,SNP,1771460,1753745,0.9900
    99.90,4062619,763129,2.1068,0.7733,-4.6855,VQSRTrancheSNP99.00to99.90,SNP,1771460,1769688,0.9990
    100.00,4192268,848731,2.0847,0.8108,-7549.2384,VQSRTrancheSNP99.90to100.00,SNP,1771460,1771460,1.0000

    See attached for the variantRecalibrator log file.

    Simon

  • rpoplinrpoplin Member ✭✭✭

    Hi there,

    We've taken a look at the logs and don't see any red flags which might explain this behavior. I wonder, can you please tell us more about how your input bams and variant calls were generated? This seems like way too many SNP calls to come from only 30 human exomes.

    Thanks,

  • I think I might have an idea now. I have been splitting the original bams into separate chromosomes using samtools and running UG over each. I haven't been specifying the chromosome with -L (I didn't think I'd need to) but looking at the logs for these, the progress meter shows that it is walking over them all. I'm guessing that this why I have so many snp calls and that this is what is messing with my tranches. I'm running it again in the correct way and will let you know the outcome.
    Thanks

  • hmk123hmk123 Member

    I may be overlooking something, but is there a way to get a "clean" VCF file out after VQSR? By this I mean a VCF file where variants that are filtered by this are removed (not just marked as problematic)? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, that's not possible. The point of VQSR is that you can fine-tune recalibration by making several passes over the same file. If it were to actively remove variants it would break the fine-tuning functionality. To produce a "clean" VCF as you describe, just do an additional step with SelectVariants.

  • chongmchongm Member

    This might be a silly question, but I was wondering whether having some PASS variants with a negative VQSLOD score was normal. In particular, I see this for indels that are called. They aren't very large negative numbers but still this would suggest a greater probability that the variant being called is false vs. true, so I'm just wondering why these would not be filtered ? Are there any special cases?

  • chongmchongm Member

    Also I used a truth sensitivity level of 95% as recommended for indels.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, sorry it took me a while to get back to you on this.

    Yes, it's ok to have PASSing variants with a negative VQSLOD. The threshold for filtering is based on the truth sensitivity and not the VQSLOD, so if you asked for a larger truth sensitivity then that would dip down lower and lower on the VQSLOD scale.

  • chongmchongm Member

    Hi Geraldine,

    I see, so do you recommend adjusting the truth sensitivity until there are no negative VQSLOD scores?

  • chongmchongm Member

    Would this be the equivalent of just filtering out negative VQSLOD scores?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, you just shouldn't worry about negative VQSLODs. Truth sensitivity is a better criterium.

  • chongmchongm Member

    For the tranches plot file, how are the number of true positive and false positive variants determined for novel variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Those are estimated based on the deviation from the specified known ti/tv ratio.

  • hmk123hmk123 Member

    I have a question about small sample sizes. We have run the VariantRecalibrator for a small targeted resequencing project. I have added this to the command line: --maxGaussians 4 --percentBad 0.05, but I still get a warning. I have tried to add .bam files from 1000s genomes (in the Unified Genotyper step), but I still get warnings. We have another 200 subjects that we have sequencing from. But our original data was from SOLiD but the new data is Illumnia. Can I put 2 different vcfs in to recalibrate? Because I don't think I would be able to put bam files from SOLiD and Illumnia into the Unified Genotyper? But please correct me if I am wrong.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Mixing technologies is generally speaking a bad idea, but if you have 200 samples with each then it should be fine because the program will have enough data to learn clusters specific to those samples in addition to learning the clusters that are shared between the technologies.

    As for how to proceed, it doesn't really matter whether you input the two VCFs directly into the VR or whether you give all 400 bams to the UG and make one callset. Just make sure that the callsets are generated using the same calling parameters, so that the call stats are comparable.

  • WimSWimS Member ✭✭

    I have 10 strains of the same non-human species and I want to use VQSR on the raw single sample SNP calls. A subset of the 10 strains have been genotyped using a SNP array. How can I use that SNP array as a truth dataset? The snp array data is in some kind of a csv format. Can I just convert that format to the BED format and extract those positions from the single sample VCF, to a truth VCF, and supply that as a truth dataset to VQSR?

    Can I use the same dataset as a training dataset, or does this need to be a different ( bigger and / or non overlapping?) dataset than the truth dataset? Or can I just for example take all the high quality (quality above 100) from the single sample raw SNP calls and supply this a a training set?

    I also have reference call's in my raw single sample VCF's. Do I manually need to exclude them completely from the VQSR process?

  • rpoplinrpoplin Member ✭✭✭

    Hi there,

    I'm glad you are having fun experimenting with the VQSR. You are correct that the first step in using your array data as a training set with the VQSR is to convert it into a VCF file.

    You can use the same input callset as the training set if you want. The VQSR does a clustering procedure and so you are essentially evaluating the variant calls by how well they cluster with the rest of the variants in your callset. The higher quality your training set is the more valuable this procedure will be.

    The VQSR will treat each record in your input VCF as a variant to cluster and evaluate-- so the answer to this question really depends on your goals. If you have meaningful variant annotations for your reference sites and there are meaningful reference training sites then it should just work but otherwise the best thing would probably be to remove them from the callset because they aren't variants that should be clustered with the rest of the variants.

    I hope that is helpful. Cheers,

  • mikemike Member

    Hi,

    I have a question about VQSR on indels variants. since indels largely depends upon the alignment where how exactly the indels are annotated for their locations (POS column of vcf files) and changes (in REF and ALT columns of vcf files). So what if the indels in GATK training data (e.g., Mills_and_1000G_gold_standard.indels.hg19.vcf) have just slightly different annotations (e.g.location, REF, ALT) compared to indels in my input vcf file (e.g. due to alignment difference or annotation difference especially in some short repeat sequences etc), does VQSR "smartly" know the indels from training data and input file are the same indels despite of slight difference in annotation of the indel or the way of annotations? Or this won't impact VQSR much as long as the two have enough overlapping of same indels? I just curious if GATK has done something to deal with those indels cases. When we used different tools to call indels, we (also others) keep seeing many indels appear to be different ones but in fact refer to the same indels if we look more closely and they are presented in vcf file of two variant calling methods in slightly different ways that could make us think they are different indels.

    Thanks and best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately, if the indel is annotated differently as you describe (eg the location is off by one) then the GATK won't recognize that it is the same as a known indel. Doing realignment around indels with a file of known indels should largely compensate for this, however.

  • mikemike Member

    Thx a lot for the info, which is what I like to know.

  • mikemike Member

    Hi,

    For VQSR, I found for -mode option, besides SNP or INDEL choices, there is a third choice as BOTH. I am a bit confused here, since almost all of the command examples show commands of VQSR quite differently for Indel and SNPs especially for the training resource files part and even best practice mentioned building VQSR model differently for indels and SNPs. I usually separated them out and do VQSR separately on indels and SNPs. Is there any reasons or circumstances that we need do VQSR using BOTH for -mode option? What is pro and con for that? Since the BOTH exists for the --mode option, I just wonder what and when to use it? if I used BOTH option for -mode, I need supply training files for both indels and SNPs at the same time, and GATK will automatically build the VQSR models separately for indels and SNPs, right?

    Thanks a lot!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    We no longer recommend using the -BOTH option; I'll check with the team whether it would be good to disable it entirely to avoid any confusion.

  • mikemike Member

    Hi, Geraldine:

    Thanks for the clarification. Just want to make sure, since one of my colleagues indeed had used that option )-BOTH) in VQSR to generate the final variants callsets, do you think it is still OK or we shall re-do it by first separating the indels and SNPs out into separate vcf files and then run VQSR on them separately?

    What is your suggestion?

    Thanks again

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    I would indeed recommend redoing the recalibration, but no need to separate out the snps and indels. You can just recalibrate them successively in the same file because when you specify eg SNP mode, the indels are emitted without modification, and vice-versa. See the VQSR tutorial for a detailed walk through of the process.

  • mikemike Member

    thx a lot! Mike

  • bd5fh2bd5fh2 Member

    Hi, I hope this is the right place to post this question about vqsr FILTER field. After going through the gatk forum I couldn't find a satisfactory answer to my question for a clear interpretation of the definition of . (dot) value in the FILTER field assigned by VQSR. This description is the closest I can get:
    "FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis."

    However, it doesn't explain well exactly in what situation(s) no filtering is applied to an SNP, and why? Should I rely on "PASS" only, or should I consider those with "." in a separate bucket?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The GATK callers (UnifiedGenotyper and HaplotypeCaller) emit all variant records with a dot in the FILTER field by default, except those that are under the call confidence threshold (but still above the emit confidence threshold), which have FILTER instead of a dot. After that, any filtering you perform on the variants will typically change the variant FILTER status to either PASS or a fail value. The only exception is when you run VQSR for example in SNP mode, which leaves any indels untouched, and vice versa.

    So typically if you encounter variants with a dot in the FILTER field, it means that either it is a raw callset just fresh from calling, and no additional filtering has been applied yet, or VQSR has only been run on the other type of variant.

    Does that help?

  • bd5fh2bd5fh2 Member

    Thank you. It does

  • Could you tell me when VariantRecalibrator would use more memory? I have two tests with whole genomes, one with a single library, and the other with multiple lanes with the same library (technical replicates). Both sets are dedupped, realigned, recalibrated, and variants called with UnifiedGenotyper. The files are comparable in size - the single lane of variants is 814M, and the merged lanes of variants is 954M. Here is the command line I'm using:

    java -Xmx18000M -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_random.fa -mode SNP \ -tranche 99.0 -percentBad 0.01 -minNumBad 1000 -input merged_raw.vcf -recalFile snp.recal -tranchesFile snp.tranches \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -nt 8

    Despite the fact that I'm only giving 18G to the JVM, the amount of memory used on the machine for the single lane of variants hits a maximum of 38G. The merged lanes of variants only use ~19G. They both complete, thank goodness, but I'm very curious as to why the single lane needs so much more memory than the merged set. Lower confidence perhaps?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I don't think I've seen this before. Is this something you observe consistently or is it a one-off observation?

  • I'm running analysis now to determine whether it's consistent. Unfortunately, it'll be a few days before I have an answer. Do you have any idea why it would balloon like that?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not really, no. If it's a one-off thing it could be just a server hiccup...

  • williamwwilliamw North CarolinaMember

    I am trying to call variants at a number of specific sites across the genome (splice sites) over about 100 samples. I've run haplotype caller on this data set and produced a set of about 3117 snps at the TS=99.0 level. The tranche plot for the snps recalibration looks pretty similar to the one above ( although the Ti/Tv ratio drops to about .9 in the 99.9 level which greatly increases the amount of false positives there. The tranche plot for after I filter the indels is very stange looking however....The bars actually go towards a negative number of novel variants? Not quite sure how to interpret this. Some help would be appreciated!

  • JiahaoJiahao Member

    Hi,

    I have run the GATK realn-recal bam processing, UnifiedGenotyper and HaplotypeCaller for the tumor-normal paired data from exome capture. Now I have the raw variant vcf results, and am wondering whether should I employ the VariantRecalibrator, which is recommended for large data set (with large number of samples?). If yes, should I merge the vcf files from tumor and normal in advance? or proceed them seperately?

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @williamw,

    The problem here is that the tranches plot is not applicable to indels. In the next version we will disable generation of the plots to avoid this confusion.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Jiahao,

    You can only run VariantRecalibrator jointly on samples that were called (run through UG or HC) jointly in the first place. So if you want to recalibrate T-N pairs together, you should go back and call them jointly, if you didn't already do that.

  • vsvintivsvinti Member ✭✭
    edited November 2013

    Question about VQSR plots: I don't have much experience in determining bad plots, but based on my understanding of this page, my attached example looks pretty bad - what do you think? I am trying to figure out how to improve these. My dataset is whole exomes (~ 1,100 samples), called recently with version 2.7-2 (UG) with your latest recommendations.

    Would any of the following contribute to this effect :

    • I ran UG -stand_call_conf and -stand_emit_conf set to 10 (default is 30)
    • I used ts_filter_level 99.9 (in your best practices). I am dealing with a homogeneous population, according to the paragraph below, would it be better to use 99.0 ? "Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope."
    • I ran VQSR using numBad 10000, 20000, 30000, 40000, 50000 and 100000 (out of ~ 1million SNPs). The different values here don't seem to make any drastic improvements.

    Would appreciate any thoughts/suggestions, as I'm a bit lost at this point. I have just watched your videos on this again, haven't made much progress.

  • vsvintivsvinti Member ✭✭

    As a side, is HaplotypeScore worth looking at? The new best practice does not seem to list it in the recommended options for VQSR

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vsvinti,

    That does look pretty bad. You mention you're working with exomes; did you use a list of capture targets for the pre-processing and calling steps?

  • vsvintivsvinti Member ✭✭

    Yes, I have used the -L option with intervals at each step of the process (including VQSR). I have split those into several files though, and ran UG on each, then concatenated the outputs (the remaining steps were run on concat + sorted vcf). I have just repeated this with ts_filter_level 99.0 and the plots have gotten worse.

    I have done the whole thing previously on ~ half of these samples with gatk 2.5-2, and the plots looked ok then. Any ideas of checks I could do to find the problem would be much appreciated. My data looked pretty good up to this point!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you try running the VQSR step using GATK 2.5-2? It would be good to know if the results are significantly different, in case it's an issue with the recent changes to VQSR. If they're not then maybe it's an issue with recent changes to the calling step.

  • vsvintivsvinti Member ✭✭

    Ok. I'll try to find the exact options I used last time for 2.5-2. Will report back soon.

  • vsvintivsvinti Member ✭✭
    edited November 2013

    Alright, here it is. Repeated the VQSR step with gatk 2.5-2, using the recommended options at the time I ran this last.

    Either:

    • The update in the recommended options create such a stark difference in results, or
    • The differences are due to the VQSR version

    They are looking pretty good to me - what do you think? (I don't know about the tranche, have to read a bit more on that). Also, how come it looks like there are so few points in this graph, compared to above (or are they overlaid)?

  • vsvintivsvinti Member ✭✭
    edited November 2013

    Update: I have re-done this a second time with 2.5, but using the recent recommended options. It's looking good! Two more questions (sorry) :

    • Since I am back to using percentBad and numNumBad, I can't find if the recommended options were -percentBad 0.01 -minNumBad 1000 (and whether I need to increase/experiment with minNumBad similarly to the new numBad?)
    • You say that the order of the -an options is important. I can presumably get the standard deviations for these from the recal files (?), and use this to order the options and rerun. Do they need to be sorted in increasing or decreasing order standard deviation?

    Many thanks!

    As a side, let me know if you are finding the same issues with 2.7 and when a patch might be available..

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that does look like an issue with the released changes. Now, the main VQSR developer has been working hard on improvements to fix a number of issues that have popped up. You might try the latest nightly build and see if the dev version works better on your data. Sorry for the inconvenience, this is a complex piece of software and it takes a lot of experimenting to get it right.

  • vsvintivsvinti Member ✭✭

    That's understandable. I will try the nightly build and let you know how I get on. In the mean time, I look forward to your thoughts on the post above about optimal options for VQSR 2.5.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh right; yes you can try increasing percentBad and minNumBad. Regarding the ordering of the an options, I'm not sure, but I'll ask.

  • vsvintivsvinti Member ✭✭

    Ok, please let me know when you get that info - hopefully soon enough. Appreciate it!

  • vsvintivsvinti Member ✭✭
    edited November 2013

    Sorry to report that the VQSR in the latest patch (GenomeAnalysisTK-nightly-2013-11-14-g27609c3) still performs badly on my data. It seems like I need to finish this data off with VQSR 2.5. I look forward to hearing on the above order of -an options for 2.5. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, I'll let you know when the new improvements/fix becomes available. Regarding the annotations, the dev tells me that ordering the annotations doesn't actually make the results better, but it makes them more stable across multiple runs. And the idea is to order them in order of increasing std deviation.

  • vsvintivsvinti Member ✭✭

    That's great - thank you for clarifying!

  • Hi Geraldine,

    I have a question about the plots created after VQSR.
    In my case, I sometimes observe that although the true variants cluster very well they are overlapped by the variants from the negative training. Also, from the area of well clustered true variants I can see that some of them are filtered.
    Is this acceptable and part of the process or still the VQSR need optimisation? Please see the attached plots.
    I also can see the same in the plots above in this post.
    I use the 2.7-2 GATK.

    Thank you in advance

    Dimitar

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @ddzankoff,

    Your plots look really good! Very nice clustering. And yes it is normal to have some overlap of bad variants in the good variants area -- it just means that they were considered bad in other dimensions than those shown in the plots. Based on this I would say your callset is good to go.

  • Hi Geraldine,

    Thank you for your fast reply.
    It is big relief to know that. I just used the default parameters and followed the guides from this website. You are doing really very helpful job!

    Thank you again

    Dimitar

  • vsvintivsvinti Member ✭✭

    Hi there

    Thanks for the VQSR updates in gatk 2.8. I have run this again with v2.5, 2.7 and 2.8 (this time I have ~ 1,000 more samples). V2.7 still looks equally bad, while 2.5 and 2.8 show similar vqsr plots.

    However, for some plots, I seem differences such as the attached, where MQRankSum in 2.8 draws a lot of red points (filtered) to the left of 0, while this is not so evident in 2.5. Which of these describes a better behaviour?

    For the tranches plot, I get TiTv ratios of 1.7 (v2.5) and 1.94 (v2.8). I am dealing with exomes - is this not too low? I have used the --target_titv 3.2 argument for VariantRecalibrator.

    Would appreciate your thoughts once again.
    Thanks!

    Order of diagrams: plots vqsr 2.5, plots vqsr 2.8, tranches v2.5, tranches v2.8

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vsvinti,

    I'd have to ask @rpoplin for his expert opinion, but my impression is that 2.8 gives you a tighter clustering so I would tend to favor that version.

    Regarding the TiTv it's hard to say as it can depend on your target list.

    By the way, when you say one set is run with 2.5 and the other with 2.8, do you mean that the variants were called and recalibrated with the 2.5 version then again with the 2.8 version, or did you just redo VQSR on an existing set of calls? We don't recommend mixing and matching tools from different GATK versions in the same workflow, so for best results you should redo the calling each time.

  • vsvintivsvinti Member ✭✭

    Hi Geraldine

    Yes, I had processed the whole workflow with version 2.7-2 (as per previous post). But, because VQSR in that version does not work for me, I did the last step with 2.5 initially, and now also with 2.8. Recalling will take a some time, but I will try it if I get a chance.

    Will wait for poplin to share his ideas on the plots.
    Thanks!

  • vsvintivsvinti Member ✭✭
    edited February 2014

    Reminder @rpoplin : any thoughts on the plots above?

  • rpoplinrpoplin Member ✭✭✭

    Hi there,

    The model that was learned for ReadPosRankSum versus MQRankSum looks like it was basically doing the right thing in both cases. That is, it looks like it learned that a Gaussian centered at zero is good with probability falling off as you move away from zero.

    I wonder, do the other dimensions look reasonable?

    Can you post the log file from the 2.8 run. There is a lot of diagnostic information in there that will help us debug what might be going on with your exomes.

    Cheers,

  • mprasadmprasad FranceMember

    Hi Geraldine,

    I have a small concern about whether VQSR is working well for me with my parameters. I have a single sample from whole exome sequencing for which I am trying to call variants. Based on the Best Practices recommendation I called variants using HaplotypeCaller for my sample along with 30 other 1000G samples- I got around 159,000 variants. I am now trying to use VQSR on the generated VCF file. Everything runs normally and I get the tranche plots and recal plots. I think my recal plots look decent (please fell free to correct me if I am wrong) but I am not sure about my tranche plots (please find plots attached). I am especially concerned by the fact that the known Ti/Tv is low, between .85 and 1.365 depending on the truth sensitivity threshold. I would like to use a TS sensitivity of 99.0 or 99.9 but am concerned by the extremely low Ti/Tv values that correspond to these tranches- am I going to filter out a lot of novel variants? Is it normal to obtain such low values or am I missing something/doing something wrong? Here is the command line I am using to run VQSR.

    gatk -T VariantRecalibrator -R '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta' -input '/media/Seagate Backup Plus Drive/HDS_05/HC.raw.30s.vcf' -resource:hapmap,known=false,training=true,truth=true,prior=15.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/hapmap_3.3.b37.vcf' -resource:omni,known=false,training=true,truth=true,prior=12.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/1000G_omni2.5.b37.vcf' -resource:1000G,known=false,training=true,truth=false,prior=10.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/1000G_phase1.snps.high_confidence.b37.vcf' -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/dbsnp_138.b37.vcf' -an QD -an MQRankSum -an ReadPosRankSum -an FS -mode SNP -titv 3.2 -recalFile '/media/Seagate Backup Plus Drive/HDS_05/snps.30s.recal' -tranchesFile '/media/Seagate Backup Plus Drive/HDS_05/snps.30s.tranches' -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -rscriptFile '/media/Seagate Backup Plus Drive/HDS_05/recalibrate_SNP_plots.R'

    I used the -titv argument to set this at 3.2 as I am using exome sequencing samples. I'd really appreciate your thoughts.
    Thank you for your help,
    Megana

  • vsvintivsvinti Member ✭✭
    edited February 2014

    Hi @rpoplin

    Here are all the vqsr plots for 2.8.

    By log file, do you mean the recal file generated with the -recalFile command?

    Thanks,

    Vicky

    @rpoplin said:
    ..
    I wonder, do the other dimensions look reasonable?

    Can you post the log file from the 2.8 run. There is a lot of diagnostic information in there that will help us debug what might be going on with your exomes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I believe @rpoplin means the console output, which can optionally be saved to a log file using the --log_to_file option.

  • vsvintivsvinti Member ✭✭

    Log file is too big to attach here - where should I put it for you, @rpoplin ?

  • vsvintivsvinti Member ✭✭

    As an aside, I have been comparing the distribution of the various annotation for the PASS filtered variants (by vqsr) with the recommended hard filters.

    They are mostly ok, but I get a spike of variants with QD < 2 that have the PASS filter. When I compare the Ti/Tv ratios of PASS variants with QD < 2 with those with QD >= 2, there is a big difference. I ended up excluding those with QD < 2, but wondered why VQSR keeps these variants. What are your thoughts?

    TiTvVariantEvaluator  CompRod  EvalRod  JexlExpression  Novelty  nTi     nTv     tiTvRatio  nTiInComp  nTvInComp  TiTvRatioStandard  nTiDerived  nTvDerived  tiTvDerivedRatio
    TiTvVariantEvaluator  dbsnp    QDless2  none            all       14698   24352       0.60   30214029   15253848               1.98           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDless2  none            known      2478    1547       1.60       3069       1330               2.31           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDless2  none            novel     12220   22805       0.54   30210960   15252518               1.98           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDmore2  none            all      520893  202375       2.57   30213889   15253801               1.98           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDmore2  none            known    265420   88415       3.00     263818      88290               2.99           0           0              0.00
    TiTvVariantEvaluator  dbsnp    QDmore2  none            novel    255473  113960       2.24   29950071   15165511               1.97           0           0              0.00
    
  • ChrisPattersonChrisPatterson Epilepsy Genomics CenterMember

    Hello,
    I am brand new to using GATK. We've recently collected WES on affected members and married-in controls of several large families of different ethnic origins. We have 6-8 samples per family. I read on VSQR documentation that it works best on when the sample size is greater than 30, so I was planning to download some of the BAM files from 1000 Genomes as you suggested. What considerations should I make during sample selection to provide the best VQSR results, such as matching samples that use the same Machine, Aligner, Target regions, ethnic background, etc?

    Thanks,
    Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vsvinti,

    You can upload files to our FTP server (see FAQ here for details).

    I'm not sure when Ryan (@rpoplin) will have the time to answer you, because he is scheduled to travel in a few days and has a lot of high-priority work to finish before then.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @ChrisPatterson You're on the right track -- same target regions, definitely, to maximize overlap; same machine/ seq. tech & aligner help ensure error modes are similar; same ethnic background is preferred. The closest you can get to matching your samples, the better.

  • mprasadmprasad FranceMember

    Hi Geraldine,

    I realize that you are too busy to have a look at everyone's VQSR plot, so I'll abridge my earlier question- pertaining to my prior post I just wanted to make sure that having a tranche plot where the TS thresholds of 90, 99, 99.9 and 100 have corresponding Ti/Tv values much lower than expected for exome sequencing (bet. .85 and 1.36) isn't a cause for concern, especially if the recal plots look good. If this is unusual, I'd appreciate some suggestions on where to begin tweaking the variant calling/recalibration steps. This is my first time using VQSR so I'd like to make sure all is well before proceeding.

    Thank you for your help,
    megana

  • vsvintivsvinti Member ✭✭

    @rpoplin , I have uploaded vsvinti_test5VQSR.logfile.log to the ftp server. I will wait a couple of weeks for response. Thanks.

Sign In or Register to comment.