Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about VQSR tutorial

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: (howto) Recalibrate variant quality scores = run VQSR.

Comments

  • avidLearneravidLearner Member

    Hello, I have a few questions that I would like to clarify:

    1. Is the file 1000G_phase1.snps.high_confidence.vcf available in the resource bundle? (I looked it up and it wasn't there!) If not, which file is recommended with the updated VariantRecalibrator parameters?

    2. Just to clarify, are the parameters described above pertinent to whole genomes or whole exomes? (I am assuming whole genomes but this has not been mentioned explicitly anywhere.) I understand that it is recommended to use --maxGaussians 4 --percentBad 0.05 for few exome samples. In this case, is -minNumBad of 1000 recommended or should a lower value be set?

    3. The VariantRecalibrator documentation parameters (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html) are slightly different from the parameters described here. Is this tutorial the latest best practices guideline that can be followed for VQSR?

    Thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    • The filenames are a little different from what we have in the bundle, I will correct that soon (sorry) but basically you want the 1000 Genomes Phase1 SNPs that should be in our bundle.

    • These parameters are for WGS. I will add a note to explain that.

    • The parameters given in tutorial documents are not necessarily up-to-date with our Best Practices recommendations. These can be found in the Best Practices document on our website that concentrates (or points to) the actual parameter recommendations. For VQSR there is a specific FAQ document, which is linked to by the BP document. We will try to make that more clear in the documentation.

  • avidLearneravidLearner Member

    Hello Geraldine, I searched for the 1000 Genomes Phase 1 SNPs VCF in ftp://ftp.broadinstitute.org/bundle/2.3/b37/ but no luck! Please tell me if I'm looking in the wrong place.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey @avidLearner, sorry to get back to you so late. You've been looking in the version 2.3 of the bundle, but you should be looking in the latest version, which is 2.5.

  • avidLearneravidLearner Member

    Hi Geraldine, yes I figured it out. I was initially accessing ftp://ftp.broadinstitute.org/bundle/ through Internet Explorer and for some reason it wasn't loading the updated version 2.5. But when I accessed the resource bundle through wget ftp://[email protected]/bundle/2.5/ I was able to locate the latest version. Sorry for the trouble. Thanks for the help!

  • yingzhangyingzhang minneapolisMember
    edited August 2013

    This is a really useful article on the variant recalibration. And I think it makes more sense compared to the brief overview at http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

    But I do have a question, which might only be my own over-reacting.

    In the overview, you state:

    =======

    We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:

    snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
    indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
    recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
    analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
    

    ========

    The difference between the overview and this FAQ is that how one should build the error model for indel.

    In overview, it builds the error model for indel from the raw_variant.vcf; while in this article the indel model is built from the recalibrated_snps_raw_indels.vcf.

    I assume there should be no cross-talk between the INDEL and SNP types. But if there is such a case, then the indel model from raw_variant.vcf will differ from the indel model from the recalibrated_snps_raw_indels.vcf.

    Should I follow the overview or this FAQ?

    And when you say "best results", what else have you tested?

    Best,
    Ying

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Ying,

    We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.

    When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.

    This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.

    Does that make more sense?

  • yingzhangyingzhang minneapolisMember

    Thank you Geraldine. Your comments are really explicit.

    @Geraldine_VdAuwera said:
    Hi Ying,

    We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.

    When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.

    This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.

    Does that make more sense?

  • Hi @Geraldine_VdAuwera,

    Is this two step process described below also working for version 1.6?
    (I am using v1.6-22-g3ec78bd)

    "When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round."

    Or is 1.6 still using the separate process?

    "old way of proceeding, where we recalibrated indels and SNPs in separate files"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    With 1.6 you should recalibrate variants separately (the old way).

    FYI version 2.7 has some key improvements to VQSR that make it work better and more consistently. We really don't recommend using older versions.

  • avilellaavilella Member
    edited September 2013

    Can I recalibrate each of the chromosomes as separate vcf, each time pointing to the whole resource files? Will it make a huge difference compared to merging the chromosome vcfs and recalibrating only once?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It's much better to recalibrate all the chromosomes together because it gives the VQSR model more data, which will make it more accurate and reliable. In many cases individual chromosomes don't even have enough variants for recalibration to work at all.

  • prepagamprepagam Member

    Can I check VQSR should be done on each population separately like Unified Genotyper? I have 40 samples but from many populations - so probably not enough data for VQSR if its by population.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @prepagam,

    You should run VQSR jointly only on samples that were called jointly; so if you called your samples in subsets by population, you also have to recalibrate them according to the same subsets.

  • splaisansplaisan Leuven (Belgium)Member ✭✭

    Further in the workflow, I noticed the now fixed --maxGaussians 4 ( not yet fixed in the d. text part ) and that the tranche followed by a range notation was not accepted on my machine, is this normal?

    # fails even with single or double quotes around []
        -tranche [100.0, 99.9, 99.0, 90.0] \ 
    # works with
        -tranche 100.0 \
        -tranche 99.9 \
        -tranche 99.0 \
        -tranche 90.0 \
    

    thanks

    Stephane

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @splaisan,

    You are correct, the way you got it to work is the right way to do it. I'll rephrase it in the doc since it seems to be confusing people. Thanks for pointing this out!

  • splaisansplaisan Leuven (Belgium)Member ✭✭

    sad, the bracket notation was cool!
    ;-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yep. Maybe if we find time we'll amend how the arguments can be passed in... (but don't hold your breath)

  • splaisansplaisan Leuven (Belgium)Member ✭✭

    Inspired by the broadE 2013 video#5 by Ryan Poplin (maybe a question for Ryan!) - great video BTW
    Context: I took IIlumina NA18507 chr21 mappings (originally to hg18) where most of the bam header is missing and poorly complies to Picard+GATK. I first extracted the reads to FASTQ to remap them to hg19 and then followed the full Best practice pipeline.

    Q: When such data, should I add one @RG corresponding to each original lane present in the hg18 bam file (thefore exporting each lane to a separate fastQ) in order to take advantage of the lane bias detection of the recalibration? or will the effect be minimal and not justifying the long run?

    THX

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you should definitely assign RGs per original lane for best recalibration results. It is totally worth the extra effort, because different lanes are likely to suffer from different error modes.

    Keep in mind also that when reverting to FASTQ, it's important to randomize the reads in order to avoid biasing the aligner. I forget if you're the person I recently discussed this with on this forum; if not I recommend you check out our tutorial on reverting bams to Fastq.

  • splaisansplaisan Leuven (Belgium)Member ✭✭

    Hi again,
    About -an parameters.

    When the HC has been used without specifying annotation types (as the case in the separate tut), do we always get the 'an' types described above? Is this the full list or some best practice defaults?
    Is there a rationale on which to be used in genome seq and what about the mirror question for the UG.
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    There is a set of core annotations that are used by default for both UG and HC. They typically don't change across versions. They are the "Standard" annotations; you can get a list by running VariantAnnotator with the -list argument, or if you want to check if a specific annotation is in the standard set, you can look it up on its tech doc page, e.g. here toward the top of the page, where it says "Type:". In the current defaults, the only difference between UG and HC is that HC additionally uses ClippingRankSumTest and DepthPerSampleHC.

    The choice of the default set (and any other recommendation we may make in the documentation) is based largely on empirical observations of what annotations are consistently informative and useful for filtering in our work. The caveat is that most of our work is done on specific data types (e.g. well-behaved Illumina exomes) so if you're using different data types you may need to experiment with other annotations. But we try to make our recommendations as broadly applicable as possible.

  • vsvintivsvinti Member ✭✭

    Hello there, Wondering what would be a sensible option for -numBad. You say "The default value is 1000, which is geared for smaller datasets. If you have a dataset that's on the large side, you may need to increase this value considerably."

    I have ~ 1,100 samples, and I set -numBad to 10,000. Would that be reasonable? I am working with whole exome data. Your previous recommendations had "-percentBad 0.01 -minNumBad 1000". Does the new numBad replace both of these ? Cheers!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, the new -numBad argument replaces the other two.

    The important metric for setting numBad is not really the number of samples, but more so the number of variants initially called.

  • vsvintivsvinti Member ✭✭

    Right, of course. So for ~ 900,000 variants, would 10,000 suffice for numBad (or how does one determine the 'ideal' number)?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It's hard to say up front because it depends a lot on your data, but I think 10K out of 900K is a good place to start, though I would recommend doing several runs with increasing numbers, and choosing the one that gives the best-looking plots. The good news is we are working on a way to have the recalibrator calculate the optimal number automatically to take the guesswork out of this process...

  • vsvintivsvinti Member ✭✭

    That's great, thanks!

  • vsvintivsvinti Member ✭✭

    Hi there, another question on VQSR training sets / options. This page suggests the following:
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \
    -resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
    -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \
    -mode SNP -numBad 1000 \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0

    While the section on "What VQSR training sets / arguments should I use for my specific project?" says
    --numBadVariants 1000 \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
    -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \

    Both of these pages are on the current best practices page. For omni, the first suggests truth=false, while the latter says truth=true. I assume one of them is wrong - please advise which is correct. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vsvinti,

    Thanks for pointing that out, I understand is is confusing. For the question of which document is correct, the general rule is that tutorials are only meant to provide usage examples and do not necessarily include the latest Best Practices parameter values. Though in this case actually it is a mistake in the command line; if you look at the parameter description earlier in the text Omni is correctly identified as a truth set. I'll fix this in the text.

  • vsvintivsvinti Member ✭✭
    edited November 2013

    Thanks for that. I have a general tendency to consider anything I find on the best practices as the updates to go by. Since the tutorial link came first, I thought they are the new updates. I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't. I generally don't read of all the subsections if I thought I found the answer. Thanks.

  • vsvintivsvinti Member ✭✭

    How about for indels? Tutorial only has mills (all set to true), while the methods article has
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't

    I'll try to think of something to make this clearer.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The methods article is always right over the tutorial. For the record, the difference it makes is very minor -- if you use the dbsnp file as known=true, that's what the recalibrator will use to annotate variants as known or novel. Otherwise it will use the Mills set. But this does not play any role in the recalibration model calculations.

  • rrahmanrrahman Member

    Should we add the -numBad parameter in the indel recalibration model as well for the whole exome datasets? or anything else? Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @rrahman,

    The -numBad parameter has been removed from VQSR in the latest version (2.8), so unless you're using an older version, you should not try to use this parameter.

  • rrahmanrrahman Member

    so, for the 2.8 version, shall I use the exact recalibration model for both SNPs and INDELs for the whole exome datasets?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm not sure what you mean by "exact recalibration model". Basically, follow the instructions as written in the article above.

  • rrahmanrrahman Member

    I meant the same parameters as shown in the above section. I was a bit confused cause in the earlier comments you said that they are for whole genome datasets. But anyways thanks for the clarification.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see what you mean. Yes, it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.

  • rrahmanrrahman Member

    yes, I took care for the intervals in the previous steps already and thanks again for the pointing out the expected Ti/Tv ratio parameter.

    Cheers
    Rubayte

  • On the FTP for resource bundle, I see dbsnp_138.b37.excluding_sites_after_129.vcf.gz and dbsnp_138.b37.vcf.gz. Which one should I use for resource:dbsnp? Are there some cases to use excluding_sites_after_129 and other cases for 138?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @SerenaRhie,

    You can safely use either as a resource:dbsnp for recalibration, as they will not affect the filtering of your variants. The difference is that if you use the full 138 version, more of your variants will be likely to be annotated as "known" (as opposed to "novel"), so if you are doing any meta-analysis that compares numbers of known vs. novel variants, you'll need to keep this effect in mind.

  • @Geraldine_VdAuwera said:
    it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.

    I haven't come across this recommendation for working with exomes before (that I should be specifying a list of intervals). Can you please point me to the documentation I should read to better understand? Thank you!
    Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @sfbarclay,

    We don't have documentation specifically on that (though perhaps we should -- I'll put it on the todo list) but here is a brief explanation.

    When you produce exome sequence data, by design the data only covers specific target sequences --for the most part. But there is also some off-target sequence data, which is typically not well mapped, and mostly produces false positive calls. So if you don't restrict your analysis to the regions targeted by the sequence prep, you enrich your callset for false positives. That can mess up filtering and callset evaluation, because QC metrics like titv ratio will be skewed by the false positives.

    Make sense?

  • Hi Geraldine,
    Yes, that makes prefect sense, thank you. My variant lists were a lot bigger than I thought they should be and I'm guessing this is why. So in the HaplotypeCaller, I should use the --activeRegionIn argument, and specify a bed file corresponding to my target regions. Is that right? And there wouldn't be anything additional to specify in the Varaint Recalibration step? Thanks!
    Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @sfbarclay, not quite. The argument for passing intervals is -L. It's an engine argument so you'll find it in the CommandLineGATK tool documentation. You should also look up the interval padding argument which is also recommended for best results.

    The --activeRegionIn argument is a specialized arg with a very different purpose (more so troubleshooting).

    You don't have to use -L in subsequent steps but in some cases it can speed up processing to do so, so it doesn't hurt to keep using it.

  • @Geraldine_VdAuwera, perfect. Thank you very much!

  • ChrisPattersonChrisPatterson Epilepsy Genomics CenterMember

    I'm trying to run the VariantRecalibrator on my Indels. I am using the Mills_and_1000G_gold_standard.indels.hg19.vcf as my resource (downloaded from your research bundle), as is suggested in the Guide. However, the INFO field contains no annotations, so the program isn't able to create the recalibration model. And there are no samples within the vcf just references to the original data set they were taken from, so VariantAnnotator isn't capable of filling in the missing information. Have I misunderstood which resource files I should use for this tool?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @ChrisPatterson‌ VariantRecalibrator doesn't use sample information, so those resource files are okay to use. Can you check that you got the right version of the file (e.g. hg19 vs b37), to match the reference build version that was used to map your data? Currently the tool does not check whether the reference versions match, and if you are using different builds, the contig names are different, so the tool will complain that it cannot find any overlapping variants with the right annotations. But that's not actually the problem -- the reference mismatch is the problem. We're going to fix that for the next version of course.

  • ChrisPattersonChrisPatterson Epilepsy Genomics CenterMember

    I had seen your response to an earlier question where they were having the same problem, so checking the build version was one of my first steps.

    I figured out where I made my mistake. I was using the HaplotypeCaller to call variants initially, which called both SNPs and INDELs together, but on our system, the processing time for a whole exome was going to be in weeks, so I switched over to the UnifiedGenotyper which calls SNPs and INDELs separately, so my VCF file I was trying to recalibrate for INDELs didn't actually have any INDELs in it to recalibrate.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, that makes sense. Glad you were able to track down the issue.

  • caligiuricaligiuri Member

    Hi everyone!
    I just started an exome sequencing project, and this will surely be a very easy to solve problem, but I'm stacked at the VSQR step. At the moment I'm working with just one sample, to test the pipeline I set up but I understand is not possible to run VSQR with few samples. For this reason I'm trying to download VCF files from 1000 Genomes and add it to mine. I've few questions though: first of all, the sequences were obtained with a different platform than mine (Illumina GA IIx) and I couldn't find any informations about how the variants were called (I used HaplotypeCaller). Could it be safe to run my sample together with these then? And second in the 1000 genomes ftp folder there are different VCFs, more than one for each chr and also one for autosomes. Should I merge them all together? Or would you suggest some other source for the VCF where I can find a single file already merged (or BAM in case it's better to start from them)?
    Thanks in advance for the help you can give me!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @caligiuri As explained elsewhere on this site, you should start from the bams, not the VCFs. Platform doesn't matter much. Better to choose the samples based on ethnicity (to match yours). Be sure to look up our new workflow where variants are called per-sample followed by joint genotyping.

  • caligiuricaligiuri Member
    edited August 2014

    Thanks a lot for your fast reply, Geraldine! I'll do as the new workflow suggests... I just have one more question about it, should I run HaplotypeCaller on each .bam separately? Will it take me long time to do it one at a time? Again thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @caligiuri‌ Yes, you run HC on each bam separately (with -ERC GVCF and a few additional arguments; see docs). It takes time, sure, but less than trying to run them all together (because HC's runtime increases exponentially as you add samples to a multisample analysis). And the good thing is that if you have a cluster (or several machines available) you can run the samples in parallel.

  • miaoxumiaoxu Member

    hi @Geraldine_VdAuwera, I'm now working on a targeted sequencing dataset with a target size of 160kb, sample size of 250 and average depth of about 1000X. It is to say that we got about 40G of clean data for calling variants after mapping. In my initial call by UG and hard filter we found about 7000~8000 variants totally and 1000~1500 variants per sample. There is no dbsnp dataset available for this quiet new genome. I'm planning to build my own reference vcf by several cycles of BQSR and HC and finally use this final reference snp as true snp for VQSR. I see in BPs you recommend a smallest sample size of exome capture experiments is 30 for VQSR. And then there would be a 180G dataset with 100X depth in average. I'm wondering whether I can apply VQSR on my dataset. It is quite small. Any suggestions will be highly appreciated.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @miaoxu,

    I think your experiment is too small for VQSR to work properly. You can try, and it may be possible to get VQSR to run on your data, but I suspect the results will not be better than with manual filtering. We are working on a new implementation of VQSR that will work better on small datasets, but it is not ready yet. Sorry for the bad news, and good luck!

  • miaoxumiaoxu Member

    hi @Geraldine_VdAuwera,

    Thank you so much for your prompt answer. I'm so glad to hear that you're developing a new implementation of VQSR for it to work better on small datasets. Look forward to the new edition.

    For now is the best way to try several cycles of BQSR + HC + Hard filtering until variants and BQSR plots converge on my dataset?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes that's right.

Sign In or Register to comment.