The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

Best recommendation for base recalibration on non-human data

sagipolanisagipolani Posts: 41Member
edited October 2012 in Ask the GATK team

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

  1. Remove duplicates via MarkDuplicates.
  2. RealignerTargetCreator
  3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
  4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

Post edited by Geraldine_VdAuwera on

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    edited October 2012 Answer ✓

    Hi Sagi,

    This is covered in the documentation for the BaseRecalibrator FAQs here:

    http://www.broadinstitute.org/gatk/guide/article?id=44

    I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

    The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

    However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    Answer ✓

    Number of rounds will depend on your dataset. More is better but at some point you begin to see convergence, and improvements are incrementally lower.

    Base recalibration performs an important function for the quality of your callset. Doing the iterative process described above is a lot of work and is not 'perfect', but it is better than skipping recalibration altogether.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    Answer ✓

    Hi Sagi,

    Base recalibration and variant recalibration are completely different processes that deal with completely different problems, at different steps of the variant calling workflow (hint: one is done on bases, the other is done on variants...).

    This is covered in detail in our documentation; a good place to start reading is the Best Practices article:

    http://www.broadinstitute.org/gatk/guide/article?id=1186

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    Answer ✓

    I would recommend manually creating a filtered subset of high-quality calls rather than relying on the LowQual annotation from the genotyper. See the variant filtering and selection tools for details on how to do this.

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    edited October 2012 Answer ✓

    Hi Sagi,

    This is covered in the documentation for the BaseRecalibrator FAQs here:

    http://www.broadinstitute.org/gatk/guide/article?id=44

    I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

    The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

    However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • sagipolanisagipolani Posts: 41Member

    Hi there!

    Thank you very much for this answer.

    In light of this, the question remains as to how many rounds of non-recalibrated-Variant-calling and recalibrated-variant-calling should be done in order to get trustworthy results?

    Maybe it's best to skip this recalibration process in the case of unavailable / trustable databases?

    Another question relates to comparison between individuals. For example, if I may have a vcf file pertaining to variants that were called in one individual and use it as an input vcf file for recalibration, am I limiting myself in this case to variants pertaining to the ones used in the input vcf files? If so, could I be giving up on identifying novel variants in the new individual?

    Thanks!

    Sagi

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    Answer ✓

    Number of rounds will depend on your dataset. More is better but at some point you begin to see convergence, and improvements are incrementally lower.

    Base recalibration performs an important function for the quality of your callset. Doing the iterative process described above is a lot of work and is not 'perfect', but it is better than skipping recalibration altogether.

    Geraldine Van der Auwera, PhD

  • sagipolanisagipolani Posts: 41Member

    OK got it: call variants on non-recalibrated bam file, and then use the output vcf file as an input for the BaseRecalibrator. Great!
    Now I'm wondering what is the difference between the BaseRecalibrator and the VariantRecalibrator, and which is better to use for non-human genomes?

    Thanks!

    Sagi

  • sagipolanisagipolani Posts: 41Member

    Another question please:
    Following your recommendations, when using the UnifiedGenotyper to call the variants prior to recalibration, the output vcf file marks the low-quality calls as "LowQual". In order to use the high-quality calls only as an input for the BaseRecalibrator tool, do I need to remove these "LowQual" calls manually prior to using this vcf file as an input file for BaseRecalibrator, or will BaseRecalibrator identify these "LowQual" variants automatically and ignore them?

    Thank you very much!

    Sagi

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    Answer ✓

    Hi Sagi,

    Base recalibration and variant recalibration are completely different processes that deal with completely different problems, at different steps of the variant calling workflow (hint: one is done on bases, the other is done on variants...).

    This is covered in detail in our documentation; a good place to start reading is the Best Practices article:

    http://www.broadinstitute.org/gatk/guide/article?id=1186

    Geraldine Van der Auwera, PhD

  • sagipolanisagipolani Posts: 41Member

    Hi Geraldine,

    Thanks again for your help - will do!

    I would kindly appreciate your input also on the first question I had:
    Following your recommendations, when using the UnifiedGenotyper to call the variants prior to recalibration, the output vcf file marks the low-quality calls as "LowQual". In order to use the high-quality calls only as an input for the BaseRecalibrator tool, do I need to remove these "LowQual" calls manually prior to using this vcf file as an input file for BaseRecalibrator, or will BaseRecalibrator identify these "LowQual" variants automatically and ignore them?

    Thanks again!

    Sagi

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin
    Answer ✓

    I would recommend manually creating a filtered subset of high-quality calls rather than relying on the LowQual annotation from the genotyper. See the variant filtering and selection tools for details on how to do this.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    Hi all!
    I'm working on non-human data. I'm wondering what do you mean with "converge", and how can I know they converged.
    Thank you!
    kanghm

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Convergence is when for successive runs of the recalibration process, you end up with roughly the same results. In other words, in the first few runs, you should see big differences each time, then the differences will get smaller and smaller each time.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    OK, thanks very much. I wonder if the same results means the same reported score after recalibration. Do I need write my own script to confirm they converged, or I can get this information elsewhere?
    And another question: every time I run base quality recalibration (in the loop), I should use the unrecalibrated bam file, not the bam file after recalibration, am I right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    If you generate the before/after plots described in the documentation, you should be able to use that to assess convergence. That's right, you should use the unrecalibrated bam file.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    “before/after plots” means "PDF file containing quality control plots showing the patterns of recalibration of the data", right? Because now I have some problems to generate this file. Can I see convergence with "The Recalibration Report"? If so, could you specify where? thank you very much!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    You should really figure out how to get those plots generated, because they are very important for quality control of the recalibration process. Without them you are working blindly. What is the problem you're having with them?

    Otherwise I guess you could compare the successive recalibration reports, sure. You would need to load the grp files into R using gsalib, and write some kind of script that plots a comparison. But it will be a little complicated; you'll need to be comfortable working with R.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    Now i'm trying to install gatk-protected-master.zip, it seems to be a must to plot.(Am I right?)
    When I install with "ant gsalib",it ouputed
    "Buildfile: /home/kanghm/gatk-protected-master/build.xml
    [taskdef] Could not load definitions from resource cloverlib.xml. It could not be found".
    If this is not a problem influence my next process,I would proceed(Because I need to upgrade my R to install gplots package).
    And I'm also refer to http://gatkforums.broadinstitute.org/discussion/1297/no-plots-generated-by-the-baserecalibrator-walker
    to get my problem solved. I pray it will help. I'm so gratitude for your patient help.
    Thank you again, you are so kind!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    It is much better to make the plots, yes.

    Someone else had the same error recently and we resolved it, have a look at this discussion: http://gatkforums.broadinstitute.org/discussion/2467/install-gsalib

    If you're going to upgrade R, make sure to do it before installing the libraries just in case the upgrade doesn't conserve them.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    OK, got it! Thanks a lot!

  • kanghmkanghm Posts: 18Member
    edited April 2013

    Hello, @Geraldine_VdAuwera:
    I'm back again! With your help, I have successfully install gsalib, and it can work now!

    However, I come to another question: why did you suggest we do our recalibration with the SNPs having highest confidence (with no-human data)? I'm doing my analysis with pig, and it has a relatively small SNP database(has about 400 thousand records) on NCBI. I've been recalibrated with this database. But when I called SNPs with this recalibrated bam file,I got a low power to detect. At the same time, when I called SNPs with unrecalibrated bam file, the power is high. So I think I should not recalibrated with this small database(because it is too small). Am I right? And you recommend we do recalibration with SNPs having highest confidence, does it means it is also a small SNP database? Then its results will equal to that of the database from NCBI. Or should I use a database larger (with the quality not the higest)? Looking forward to your anwser!

    PS: Why I think the databse on NCBI is small, a reason is the number of SNPs I called is several million(high quality, pass my filter criteria),even with my recalibrated bam file (used NCBI database).

    And the reason why I want to do reclibration is the unrecalibrated's results has a relatively high false positive rate.

    Thank you very much!
    kanghm

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi @kanghm,

    This is a difficult question because there is no fixed rule for the size of your database relative to the number of variant calls you get with your samples. It depends on a lot of things, including how much variation there is between individuals in your organism of study.

    You will have to decide for yourself based on the principle of how base recalibration works, which is this:

    When BaseRecalibrator builds the recalibration model, it assumes that every mismatch is a sequencing error EXCEPT if that position is in the database of known variants. So, the database works like a mask that "protects" SNPs from being interpreted as errors.

    The trade-off is this: if your database is very large and has a lot of junk calls (SNPs that are not real), you will over-mask and the BaseRecalibrator will underestimate the amount of errors in your data. Down the road that usually leads to more false positives because the scores of many mismatching base calls are higher than they should be. Conversely, if your database is very small (and presumably contains only very high-confidence calls), you will under-mask, and BaseRecalibrator will overestimate the amount of errors in your data. Later that usually leads to more false negatives because the scores of base calls at real SNPs are lower than they should be.

    Does that help clarify things?

    Geraldine Van der Auwera, PhD

  • ymwymw Posts: 9Member

    I have a question regarding this thread of discussion. That is choosing highly confident SNP for recalibration.
    It was recommded to manually filter the SNP from the VCF produced by UnifiedGenotype.
    Can we just simply adjust the "-stand_call_conf" and "-stand_emit_conf" option in UnifiedGenotype to product highly confident SNP?
    By doing so, it can save time from extra filtering.
    Will 50 be high enough for "-stand_call_conf" and "-stand_emit_conf" to produce good SNP calls?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi @ymw, we prefer to use UG with low threshold to produce a highly sensitive, inclusive callset, then filtering for specificity. Part of the reason is that while call confidence is a useful metric, sometimes you'll need to look at other annotated metrics to distinguish real variants from artifacts. And if you want to change your filtering / confidence parameters, it's faster to run filtering again rather than redo the calling itself.

    Geraldine Van der Auwera, PhD

  • prepagamprepagam Posts: 57Member

    I'm working on a non-human taxa - the aim to look at SNP variants for down stream analysis using statistics based on the site frequency spectrum. I'm in a situation, where I have a database of some known SNPs from population A of my species, but I'm trying to call SNPs in both population A and population B of my species. My concern is that using the database of known SNPs is going to create a bias (more detection/sensitivity for population A vs B - which will skew the site frequency spectrum). This question is really for use of databases for base quality score recalibration and variant score recalibration.

    I wondered your thoughts? If all the human databases were derived only for example from Europeans, would you be hesitant to use them in a pipeline to call SNPs in a group of African samples - if you then wanted to compare variant statistics between Europeans and Africans (although I'll add my species does show more population structure than humans!). Obviously the alternative is to pretend I don't have a database and do as you describe above (do a first round of snp calling, then use best of these as a database .....). I guess I'm unsure if this is better than using the database?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi @prepagam,

    The simplistic answer is I wouldn't worry about it. The more nuanced answer is that it would depend on how distinct your populations are from each other. In humans, we wouldn't worry about using known sites from e.g. European samples to recalibrate African samples, because there's not enough difference to impact the recalibration processes. For VQSR especially, only the sites overlapping both your known set and your test sets will be used, so really you'll be using only common sites. For BQSR it is theoretically possible that you might introduce some bias, but it would only matter if the sequencing error modes specifically affected one population's variants, which is extremely unlikely.

    Of course if you want to be super-safe, you could bootstrap your own DB of variants and use it alone, then in combination with your existing DB of sites, and compare results.

    Geraldine Van der Auwera, PhD

  • prepagamprepagam Posts: 57Member

    As per recommendations for non-human data, I've been running BQSR for multiple rounds to achieve convergence (which I define as the base qualities don't change any more and look recalibrated (i.e. on the diagonal line). I am doing this BQSR on each sample separately, and notice that different samples require different numbers of rounds of recalibration to get convergence (between 2-4) - this is consistent with the fact my data is from many sources. In order to automate my pipeline I'd like to have a specific number of rounds of BQSR I do on all samples i.e. 4 which is always enough. As such I wonder whether this might be problematic for the samples that really only needed 2 rounds of BQSR to converge? Or do I need to check each one manually after each round? And then do additional rounds as needed.

  • SheilaSheila Broad InstitutePosts: 3,748Member, Broadie, Moderator, Dev admin

    @prepagam‌

    Hello,

    This is not a problem. It is absolutely safe to do, as long as you don't mind spending the extra compute up front (as opposed to only doing >2 rounds when it looks necessary).

    -Sheila

  • tylor11tylor11 chinaPosts: 3Member

    @Geraldine_VdAuwera said:
    I would recommend manually creating a filtered subset of high-quality calls rather than relying on the LowQual annotation from the genotyper. See the variant filtering and selection tools for details on how to do this.

    i have follow your advise on knownsites.but some questions confued me. After rounds of non-recalibrated-Variant-calling and what BQSR result is right? Can i use the result to run VqSR as resourses ?

    -taylor

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    @tylor11 Your questions confuse me... Can you please clarify what you are asking?

    Geraldine Van der Auwera, PhD

  • tylor11tylor11 chinaPosts: 3Member

    My question is that how many rounds of non-recalibrated-Variant-calling and recalibrated-variant-calling should be done in order to get trustworthy results? what the 'trustworthy results' means?
    when i run some round of BQSR and get a lot of bam file that used to run a real snp calling . At last get a vcf file and the questions is that may i use the vcf file as an input file when i run VQSR?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi @tylor11‌,

    Typically, people do 2 to 4 rounds of recalibration and calling after the initial unrecalibrated calling. The idea is that you stop when the recalibration plots look the same each time. Once that is done, you can proceed to recalibrate your variants. You can use a high-confidence subset of those variants as truth/training resource for VQSR, but it's up to you to decide what level of confidence to use as a prior.

    Geraldine Van der Auwera, PhD

  • tylor11tylor11 chinaPosts: 3Member

    OK got it, thanks very much!

  • shangzhong0619shangzhong0619 La JollaPosts: 7Member
    edited September 2014

    @Geraldine_VdAuwera said:
    Number of rounds will depend on your dataset. More is better but at some point you begin to see convergence, and improvements are incrementally lower.

    Base recalibration performs an important function for the quality of your callset. Doing the iterative process described above is a lot of work and is not 'perfect', but it is better than skipping recalibration altogether.

    >

    Hello, Geraldine,
    I have a question. I know in the first round, since we don't have dbsnp for non-human organism, we will use variant filter to manually filter the raw snps and indels, but in the second round, we would have "gold standard" snps and indels got from the first round. So in this round should we filter the raw snps and indels by using VQSR with the "gold standard" from round 1 as input, or still filter manually?

    Post edited by shangzhong0619 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi @shangzhong0619,

    To clarify, we don't consider that the output of the bootstrapping rounds deserves to be called a "gold standard" because that would imply it is a very high quality callset. That is not quite true -- really it is just acceptable quality for BQSR.

    It is possible to use the same callset as a truth set for VQSR, but you need to have an estimate of how confident you are that all the sites in it are good, to calibrate the model correctly. This can be difficult to do and takes a bit of experimenting. All this to say, you can try, but the results may not be better than with manual filtering.

    Geraldine Van der Auwera, PhD

  • michelmichel Posts: 13Member

    Hi, i read the entry about how to do BQSR without a knownSNP file and still have some uncertainties..

    I am actually calling SNPs from RNA-seq data on a draft genome of a non-model organism and were wondering what best practice might be (must sound like a nightmare to you working with human :smile: ).

    I can think of following workflow:

    1. best practice SNPcalling for RNA reads with HaplotypeCaller
    2. filter variants for "high quality" (-window 25 -cluster 3 --filterExpression "MQ < 30.0" --filterExpression "QD < 2.0"  --filterExpression "DP < 5" 
    3. select for PASS SNPs and biallelic SNPs
    4. use the SNPs as knownSNPs to do BQSR
    

    Should i include heterozygous SNPs to generate the BQSR-recalibration file?
    Would you agree on that workflow or alter the filters ( i know filtering for depth is not a good thing to do but for RNA-seq i think its good to have some minimal coverage of a site)

    Comments and recommendations are very welcome,
    Thank you,

    Michel

  • SheilaSheila Broad InstitutePosts: 3,748Member, Broadie, Moderator, Dev admin

    @michel
    Hi Michel,

    I think I answered your question here: http://gatkforums.broadinstitute.org/discussion/5752/calling-snps-from-rnaseq-data-without-reference-snpset#latest Don't worry about posting twice to get our attention! We keep track of everything :smiley:

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    By which we mean: don't post twice to get our attention. It causes duplicated effort and wastes time.

    Geraldine Van der Auwera, PhD

  • jmm1jmm1 New Haven, CTPosts: 9Member

    Hello,

    I have a question about best practices for running BQSR on a smaller amplicon data set in a non-model organism. My experimental set up is that I have 30 regions amplified in 128 individuals that were sequenced in pools of ~16 individuals over 9 different Ion Torrent chips. The individuals were demultiplexed into per-sample fastq files before I began with GATK.

    Given that it is a non-model organism I have been following the best practices suggestions for bootstrapping the dataset to get a list of candidate loci for BQSR. Specifically I quality trimmed reads, aligned to a reference, recalled indels, and created gVCFs on an individual basis. I then merge the gVCFs within chips and called a set of SNPs that would be used in BQSR.

    At this point I am wondering do I implement BSQR on each sample individually or on all of the data from one chip pooled? My reading of the documentation is that BSQR is applied on a per-sample basis, but I worry that I would loose power with a smaller dataset like mine.

    Thank you.

  • SheilaSheila Broad InstitutePosts: 3,748Member, Broadie, Moderator, Dev admin

    @jmm1
    Hi,

    Usually we only recommend to run BQSR if you have more than 100 million bases in your dataset. How many do you have in yours? This thread may also help you: http://gatkforums.broadinstitute.org/discussion/4272/targeted-sequencing-appropriate-to-use-baserecalibrator-bqsr-on-150m-bases-over-small-intervals

    -Sheila

  • jmm1jmm1 New Haven, CTPosts: 9Member

    Hi Sheila,

    Thank you for the response. The total length of the targeted regions is ~25,000bp and number of bases I have per chip ranges from ~57M to ~187M. So given that it is amplicon sequencing (so there are PCR duplicates) I take it you would not recommend trying to run BQSR?

  • SheilaSheila Broad InstitutePosts: 3,748Member, Broadie, Moderator, Dev admin

    @jmm1
    Hi,

    I don't think BQSR will have enough data to make any significant changes. You can always run with and without BQSR if you have time and compare the results.

    -Sheila

  • agiletaagileta Chicago, ILPosts: 2Member

    Hi,

    Along similar lines of prepagam's question from two years back...our lab is working with outbred rats currently. The resources aren't at the same level as human or mouse just yet. The best available reference SNP data comes from 42 inbred lines, 5 of which were derived from the same strain of rat we are studying. However, the inbred lines were established tens of generations ago (if not more), in which time novel variation may have arisen. We have no estimation of how diverged our population is from those in the reference set.

    I am worried about using the large 42 genome set of variants, as I feel that it is going to be too "lenient" in a way, since BQSR will overlook a large amount of sites, even those that may not be true variants in our population. However, if I limit myself only to the 5 strains derived from our strain, there may be a lot of sites BQSR sees as errors, but are true variant sites.

    As I see it, my options are:
    1) Use the full set of 42 genome reference SNPs
    2) Use the reduced set of 5 strains specifically derived from ours, albeit a long time ago.
    3) We have WGS on 80 rats from our population at ~5x coverage and could call variants from this, and use the high quality ones moving forward.
    4) Utilize the bootstrapping method described in this thread on our data.

    I am not sure what the most appropriate approach would be in this case and would appreciate any input you could provide. Please let me know if I can provide any other information that might be helpful in making such a decision. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Since the purpose of the known sites in BQSR is to mask out real variation, and it is better to over-mask than to under-mask, I would recommend using all SNPs previously generated, and add to that a bootstrapped set from your animals.

    Geraldine Van der Auwera, PhD

  • agiletaagileta Chicago, ILPosts: 2Member

    Great, thank you so much for the advice Geraldine! Given that there is already an extensive database of variants, would I include these when generating the initial bootstrapped set so that I mask any variants that would overlap between the sets? And would I still need to perform multiple iterations of the bootstrapping as described above, even though I am not using it as my sole SNP set? Or could I just call once, take the best SNPs, and use them right away as the second set?

    One other question that I neglected to include in the previous post was concerning "high quality" variants. Do you have a recommendation about thresholds for creating this set? Perhaps a certain combination of variant quality scores, MAF, average depth over coverage, call rate, etc.? I'm just not sure how stringently the bootstrapped set should filtered.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    @agileta Yes, you can use the existing db of variants in the first bootstrapping run. Since in your case, unlike many users who need to do this, you already have a db of variants to draw on, you may not need to do multiple iterations -- but consider trying one or two more anyway to see if there is a noticeable effect.

    For the thresholds, you can start from our recommendations for hard-filtering variants (which are explained here). You may need to experiment a little and tweak the settings because results will vary depending on your particular dataset. One way to approach this is to plot the distributions of the annotations in your data to estimate what thresholds might be appropriate.

    Geraldine Van der Auwera, PhD

  • y4dary4dar TexasPosts: 5Member

    Thanks for all of the information on this thread. The detail provided is VERY helpful.

    I'm working with eight non-model taxa and have mapped reads to a reference genome. Now I am attempting to call variants vs. that reference.

    I've made it through a workflow to the point where I need to recalibrate (used IndelRealigner).

    I've read various threads on the topic and I think I'm ready to proceed with the process of calling, calibrating, and calling again, to generate a set of high confidence SNPs for a -knownSites .vcf.

    That being said, I'd like to verify my plan before moving on. I have no desire to go through what is likely to be a long process without some confidence that what I'm doing is solid.

    Here is what I plan to do. I would appreciate it if you would let me know if my plan stinks.

    1. Skip the initial recalibration step and move on to calling the SNPs with HaplotypeCaller. This should yield a .vcf with unrecalibrated SNP calls which I will call "species_raw_variants_norecal.vcf".
    2. Filter that file for high confidence SNPs (which I still need to figure out how to do) to yield a filtered set of SNPs --> "species_filtered1_variants_norecal.vcf".
    3. Now recalibrate with BaseRecalibrator using -knownSites "species_filtered1_variants_norecal.vcf".
    4. View the results.

    Here's where I run into trouble. The suggestion is to "Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

    I should now have a .bam of recalibrated reads that have been informed by the high confidence SNPs. Would that then be reused to call again (step 1 below)?

    In other words, here is what I plan to do:

    1. java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa -I species_realigned_reads.bam (from IndelRealigner) --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30 -o species_raw_variants_norecal.vcf

    2. Filter (again, I'm not sure how to filter at this point but I'll keep reading. Any suggestions?) to give me "species_filtered1_variants_norecal.vcf".

    3. java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fa -I species_realigned_reads.bam -knownSites species_filtered1_variants_norecal.vcf -o species_recal_data1.grp

    java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fa -I species_realigned_reads.bam -knownSites species_filtered1_variants_norecal.vcf -BQSR species_recal_data1.grp -o species_post_recal_data1.grp

    java -jar GenomeAnalysisTK.jar -T PrintReads -R reference.fa -I species_realigned_reads.bam -BQSR species_recal_data1.grp -o species_recal_reads1.bam

    1. View the results

    Repeat using species_recal_reads1.bam as my input?

    Thanks. I know this is a lot. I am learning as quickly as I can but need some help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    I assume that by "recalibrate (used IndelRealigner)" you mean using useing BaseRecalibrator ;)

    For your first round: you're correct except you'll get SNPs and indels out of HaplotypeCaller. For filtering, look up our hard-filtering recommendations.

    You've got all of it right except the repeat step is done on the original bam, not the recalibrated bam, using variants called in a second round of HaplotypeCaller+filtering. The idea is to refine the set of known sites. It may not take more than 2 to 3 repeat cycles, tops.

    Geraldine Van der Auwera, PhD

  • ianwianw Bangor, UKPosts: 38Member
    edited June 21

    Hi @Geraldine_VdAuwera and @Sheila

    Sorry to resurrect this thread but I wasn't sure my question warranted a new one! I'm looking for some clarification regarding BQSR bootstrapping for non-model organisms as existing documentation and Q&As have confused me a bit.

    I've sequenced 223 genomes (so far), some of which have only 1 Read Group, some of which have more, depending on when and to what depth the sequencing was performed. The majority of the genomes were sequenced to a depth of 5-7x, though some were covered to 15x. The reference genome has a total length of ~860Mb.

    After assembling and realigning each genome, I called variants using HaplotypeCaller and GenotypeGVCFs (using intervals, so as to reduce computation time). I then ran BaseRecalibrator twice and AnalyzeCovariates to generate the before and after plots. Finally, I ran PrintReads to generate recalibrated bam files for each genome.

    My questions are:

    1) Do I now use these recalibrated bam files as input files for a second round of HaplotypeCaller (and GenotypeGVCFs)?

    2) When applying hard filters in the 2nd round, I assume I should use the same filters as in Round 1 (hence, bootstrapping, right?)

    3) I've read that the 2nd round of BaseRecalibrator should be applied to the original un-recalibrated bam files. Is this correct?

    4) When attempting to re-run BaseRecalibrator with the un-recalibrated bam files and 2nd round of HaplotypeCaller-generated variants, my AnalyzeCovariates outputs were identical to the first round's output. In each set, the before and after points on each graph are very different, but there is no difference between the first and second round's output. Does this mean BQSR has worked or that I've done something wrong? I suspect the latter...

    The key problem is knowing when and where to use the original unrecalibrated bam and when/where to use each round's recalibrated bam (I know, I'm a bear of very little brain :wink: )

    Many thanks for your time and patience. Any and all feedback and advice will be gratefully received!

    Ian

    Post edited by ianw on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi Ian @ianw,

    To your questions 1 through 3, it's all yes/correct. To the fourth, it's quite possible for the first round to work well enough that you don't see a difference in the second round. I'm happy to take a look at some representative recal plots if you'd like.

    Geraldine Van der Auwera, PhD

  • ianwianw Bangor, UKPosts: 38Member

    @Geraldine_VdAuwera

    Hi Geraldine,

    Thanks for the quick response. I'll need to regenerate a couple of the 2nd round plots but I'll get them to you ASAP, thank you very much for the offer.

    Ian

  • ianwianw Bangor, UKPosts: 38Member

    @Geraldine_VdAuwera

    Good morning Geraldine,

    I've attached 4 PDFs. They were produced using Analyze Covariates for 2 samples (G01A06 and G02C07) after Round 1 (R1) and Round 2 (R2) of Base Recalibration. There are very slight differences between rounds and I'd be interested to see what you think.

    I possibly should say that I chose my 'truth sets' of SNPs and indels using the following parameters. The same parameters were used for both variant types (with the exception of the mask, which was obviously only used for the SNPs) as, after plotting them out, the same cutoffs were applicable in both instances:

     --filterExpression 'FS > 10.0' --filterName MappingQuality --filterExpression 'MQ < 55.0' --filterName QualByDepth --filterExpression 'QD < 5.0' --filterName MQRankSum --filterExpression 'MQRankSum < 2.5 || MQRankSum >-2.5' --filterName ReadPosRankSum --filterExpression 'ReadPosRankSum > 2.5 || ReadPosRankSum < -2.5' --mask ${i}_raw_indels.vcf.gz --maskName InDel
    

    If I may ask one more question, would I be best running through HC and GGVCFs etc again and then filtering using tweaked hard filters or using my truth set in VQSR, but with, perhaps, slightly more leniency than one would with, say, a human variant truth set? I'll be using the output in a GWAS.

    Many thanks,

    Ian

    PDF
    PDF
    G01A06_R1.PDF
    247K
    PDF
    PDF
    G02C07_R1.PDF
    248K
    PDF
    PDF
    G01A06_R2.PDF
    248K
    PDF
    PDF
    G02C07_R2.PDF
    249K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    The plots look very reasonable to me. You do seem to lose a bit in overall base quality; I wonder if it's because your original base quals were binned. But the difference between rounds don't seem significant to me. Your filter settings may be a bit more stringent than necessary; if you have time to experiment a little more you could relax them somewhat and see if it makes a substantial difference (in terms of number of calls made and of resulting distribution of base quals).

    VQSR is probably going to be another can of worms' depending on the known sites resources you have (or can generate from this data) because it needs a lot of sites to build a robust model. As you have fairly low coverage (compared to what we normally work with) the discriminating power of each annotation may be weaker than you'd like, so you may need more sites than usual to get a solid model. You can definitely experiment with it but I can't guarantee that it will perform well, especially on a testing subset. You're only likely to get clear answers stage once you're running it on your full dataset (of all your samples).

    Geraldine Van der Auwera, PhD

  • ianwianw Bangor, UKPosts: 38Member

    That's good to know, although how can you see a difference in base quality? With the exception of a few points, the plots all look identical to my (untrained) eye! Also, you mention the base quals being binned - this is a default setting when dealing with Illumina data, right? So is it basically the case that resolution of scores is lower so any decreases can appear larger than they actually are because they drop into lower value bins? I assume this isn't actually cause for concern though?

    Final question from me - I got 2,964,578 SNPs that passed my original variant calling step. After BQSR and a 2nd round of variant calling, I got 3,340,666 SNPs. No doubt I'm just misunderstanding but I'd expected this number to decrease after BQSR if the same filters are being used both times. Could you perhaps point me to an explanation of why this happens, please?

    I may have time to play with the filters for BQSR, although this would mean going back to the start, I assume, so I'll be limited to one or two attempts. And I thought that might be the case regarding VQSR. I suspect I'll play around with hard filters then, at least for now!

    Issue · Github
    by Sheila

    Issue Number
    1025
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,388Administrator, Dev admin

    Hi @ianw, if you look at the third page of the pdfs you'll see histograms that show the distribution of base quality scores in your data before (blue) and after (pink) recalibration. There does not seem to be much difference between bootstrapping rounds, but there's definitely a big difference between before and after.

    The binning is indeed done by default by Illumina; that in itself is not a cause of concern, but it seems to me that the shift in quals goes beyond just increased resolution. This is not necessarily a bad thing (sometimes the quality of calls overall is truly lower than what is reported by the machine) but it could also indicate that not enough real sites are being masked by the bootstrap set (which is why I was suggesting to relax filtering parameters). It's better to over-mask a little than under-mask.

    Regarding the number of variants called after BQSR, we don't actually have an expectation of whether there should be more or fewer. It's possible that a subpopulation may have been missed or called with lower confidence prior to recalibration, since recalibration can go in both directions (even if overall your qualities are lower, a subset of position may benefit from increased scores). One thing you can do if you're concerned is examine the intersection between the two callsets, as well as the private calls in either, and look at the annotations and properties of the sequence at those sites.

    Geraldine Van der Auwera, PhD

  • ianwianw Bangor, UKPosts: 38Member

    Hi @Geraldine_VdAuwera

    Thank you, that all makes sense :smiley: I'd assumed that it was a case of the stricter the better for a GWAS - good to know; I'll try loosening the parameters and comparing the results.

    Thanks again - as ever, your answers have been very helpful!

Sign In or Register to comment.