Best recommendation for base recalibration on non-human data

sagipolanisagipolani Posts: 41Member
edited October 2012 in Ask the GATK team

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

  1. Remove duplicates via MarkDuplicates.
  2. RealignerTargetCreator
  3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
  4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

Post edited by Geraldine_VdAuwera on

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
    edited October 2012 Answer ✓

    Hi Sagi,

    This is covered in the documentation for the BaseRecalibrator FAQs here:

    http://www.broadinstitute.org/gatk/guide/article?id=44

    I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

    The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

    However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
    Answer ✓

    Number of rounds will depend on your dataset. More is better but at some point you begin to see convergence, and improvements are incrementally lower.

    Base recalibration performs an important function for the quality of your callset. Doing the iterative process described above is a lot of work and is not 'perfect', but it is better than skipping recalibration altogether.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
    Answer ✓

    Hi Sagi,

    Base recalibration and variant recalibration are completely different processes that deal with completely different problems, at different steps of the variant calling workflow (hint: one is done on bases, the other is done on variants...).

    This is covered in detail in our documentation; a good place to start reading is the Best Practices article:

    http://www.broadinstitute.org/gatk/guide/article?id=1186

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
    Answer ✓

    I would recommend manually creating a filtered subset of high-quality calls rather than relying on the LowQual annotation from the genotyper. See the variant filtering and selection tools for details on how to do this.

    Geraldine Van der Auwera, PhD

Answers

  • sagipolanisagipolani Posts: 41Member

    Hi there!

    Thank you very much for this answer.

    In light of this, the question remains as to how many rounds of non-recalibrated-Variant-calling and recalibrated-variant-calling should be done in order to get trustworthy results?

    Maybe it's best to skip this recalibration process in the case of unavailable / trustable databases?

    Another question relates to comparison between individuals. For example, if I may have a vcf file pertaining to variants that were called in one individual and use it as an input vcf file for recalibration, am I limiting myself in this case to variants pertaining to the ones used in the input vcf files? If so, could I be giving up on identifying novel variants in the new individual?

    Thanks!

    Sagi

  • sagipolanisagipolani Posts: 41Member

    OK got it: call variants on non-recalibrated bam file, and then use the output vcf file as an input for the BaseRecalibrator. Great! Now I'm wondering what is the difference between the BaseRecalibrator and the VariantRecalibrator, and which is better to use for non-human genomes?

    Thanks!

    Sagi

  • sagipolanisagipolani Posts: 41Member

    Another question please: Following your recommendations, when using the UnifiedGenotyper to call the variants prior to recalibration, the output vcf file marks the low-quality calls as "LowQual". In order to use the high-quality calls only as an input for the BaseRecalibrator tool, do I need to remove these "LowQual" calls manually prior to using this vcf file as an input file for BaseRecalibrator, or will BaseRecalibrator identify these "LowQual" variants automatically and ignore them?

    Thank you very much!

    Sagi

  • sagipolanisagipolani Posts: 41Member

    Hi Geraldine,

    Thanks again for your help - will do!

    I would kindly appreciate your input also on the first question I had: Following your recommendations, when using the UnifiedGenotyper to call the variants prior to recalibration, the output vcf file marks the low-quality calls as "LowQual". In order to use the high-quality calls only as an input for the BaseRecalibrator tool, do I need to remove these "LowQual" calls manually prior to using this vcf file as an input file for BaseRecalibrator, or will BaseRecalibrator identify these "LowQual" variants automatically and ignore them?

    Thanks again!

    Sagi

  • kanghmkanghm Posts: 18Member

    Hi all! I'm working on non-human data. I'm wondering what do you mean with "converge", and how can I know they converged. Thank you! kanghm

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Convergence is when for successive runs of the recalibration process, you end up with roughly the same results. In other words, in the first few runs, you should see big differences each time, then the differences will get smaller and smaller each time.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    OK, thanks very much. I wonder if the same results means the same reported score after recalibration. Do I need write my own script to confirm they converged, or I can get this information elsewhere? And another question: every time I run base quality recalibration (in the loop), I should use the unrecalibrated bam file, not the bam file after recalibration, am I right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    If you generate the before/after plots described in the documentation, you should be able to use that to assess convergence. That's right, you should use the unrecalibrated bam file.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    “before/after plots” means "PDF file containing quality control plots showing the patterns of recalibration of the data", right? Because now I have some problems to generate this file. Can I see convergence with "The Recalibration Report"? If so, could you specify where? thank you very much!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    You should really figure out how to get those plots generated, because they are very important for quality control of the recalibration process. Without them you are working blindly. What is the problem you're having with them?

    Otherwise I guess you could compare the successive recalibration reports, sure. You would need to load the grp files into R using gsalib, and write some kind of script that plots a comparison. But it will be a little complicated; you'll need to be comfortable working with R.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member

    Now i'm trying to install gatk-protected-master.zip, it seems to be a must to plot.(Am I right?) When I install with "ant gsalib",it ouputed "Buildfile: /home/kanghm/gatk-protected-master/build.xml [taskdef] Could not load definitions from resource cloverlib.xml. It could not be found". If this is not a problem influence my next process,I would proceed(Because I need to upgrade my R to install gplots package). And I'm also refer to http://gatkforums.broadinstitute.org/discussion/1297/no-plots-generated-by-the-baserecalibrator-walker to get my problem solved. I pray it will help. I'm so gratitude for your patient help. Thank you again, you are so kind!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    It is much better to make the plots, yes.

    Someone else had the same error recently and we resolved it, have a look at this discussion: http://gatkforums.broadinstitute.org/discussion/2467/install-gsalib

    If you're going to upgrade R, make sure to do it before installing the libraries just in case the upgrade doesn't conserve them.

    Geraldine Van der Auwera, PhD

  • kanghmkanghm Posts: 18Member
  • kanghmkanghm Posts: 18Member
    edited April 2013

    Hello, @Geraldine_VdAuwera: I'm back again! With your help, I have successfully install gsalib, and it can work now!

    However, I come to another question: why did you suggest we do our recalibration with the SNPs having highest confidence (with no-human data)? I'm doing my analysis with pig, and it has a relatively small SNP database(has about 400 thousand records) on NCBI. I've been recalibrated with this database. But when I called SNPs with this recalibrated bam file,I got a low power to detect. At the same time, when I called SNPs with unrecalibrated bam file, the power is high. So I think I should not recalibrated with this small database(because it is too small). Am I right? And you recommend we do recalibration with SNPs having highest confidence, does it means it is also a small SNP database? Then its results will equal to that of the database from NCBI. Or should I use a database larger (with the quality not the higest)? Looking forward to your anwser!

    PS: Why I think the databse on NCBI is small, a reason is the number of SNPs I called is several million(high quality, pass my filter criteria),even with my recalibrated bam file (used NCBI database).

    And the reason why I want to do reclibration is the unrecalibrated's results has a relatively high false positive rate.

    Thank you very much! kanghm

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @kanghm,

    This is a difficult question because there is no fixed rule for the size of your database relative to the number of variant calls you get with your samples. It depends on a lot of things, including how much variation there is between individuals in your organism of study.

    You will have to decide for yourself based on the principle of how base recalibration works, which is this:

    When BaseRecalibrator builds the recalibration model, it assumes that every mismatch is a sequencing error EXCEPT if that position is in the database of known variants. So, the database works like a mask that "protects" SNPs from being interpreted as errors.

    The trade-off is this: if your database is very large and has a lot of junk calls (SNPs that are not real), you will over-mask and the BaseRecalibrator will underestimate the amount of errors in your data. Down the road that usually leads to more false positives because the scores of many mismatching base calls are higher than they should be. Conversely, if your database is very small (and presumably contains only very high-confidence calls), you will under-mask, and BaseRecalibrator will overestimate the amount of errors in your data. Later that usually leads to more false negatives because the scores of base calls at real SNPs are lower than they should be.

    Does that help clarify things?

    Geraldine Van der Auwera, PhD

  • ymwymw Posts: 9Member

    I have a question regarding this thread of discussion. That is choosing highly confident SNP for recalibration. It was recommded to manually filter the SNP from the VCF produced by UnifiedGenotype. Can we just simply adjust the "-stand_call_conf" and "-stand_emit_conf" option in UnifiedGenotype to product highly confident SNP? By doing so, it can save time from extra filtering. Will 50 be high enough for "-stand_call_conf" and "-stand_emit_conf" to produce good SNP calls?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @ymw, we prefer to use UG with low threshold to produce a highly sensitive, inclusive callset, then filtering for specificity. Part of the reason is that while call confidence is a useful metric, sometimes you'll need to look at other annotated metrics to distinguish real variants from artifacts. And if you want to change your filtering / confidence parameters, it's faster to run filtering again rather than redo the calling itself.

    Geraldine Van der Auwera, PhD

  • prepagamprepagam Posts: 19Member

    I'm working on a non-human taxa - the aim to look at SNP variants for down stream analysis using statistics based on the site frequency spectrum. I'm in a situation, where I have a database of some known SNPs from population A of my species, but I'm trying to call SNPs in both population A and population B of my species. My concern is that using the database of known SNPs is going to create a bias (more detection/sensitivity for population A vs B - which will skew the site frequency spectrum). This question is really for use of databases for base quality score recalibration and variant score recalibration.

    I wondered your thoughts? If all the human databases were derived only for example from Europeans, would you be hesitant to use them in a pipeline to call SNPs in a group of African samples - if you then wanted to compare variant statistics between Europeans and Africans (although I'll add my species does show more population structure than humans!). Obviously the alternative is to pretend I don't have a database and do as you describe above (do a first round of snp calling, then use best of these as a database .....). I guess I'm unsure if this is better than using the database?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @prepagam,

    The simplistic answer is I wouldn't worry about it. The more nuanced answer is that it would depend on how distinct your populations are from each other. In humans, we wouldn't worry about using known sites from e.g. European samples to recalibrate African samples, because there's not enough difference to impact the recalibration processes. For VQSR especially, only the sites overlapping both your known set and your test sets will be used, so really you'll be using only common sites. For BQSR it is theoretically possible that you might introduce some bias, but it would only matter if the sequencing error modes specifically affected one population's variants, which is extremely unlikely.

    Of course if you want to be super-safe, you could bootstrap your own DB of variants and use it alone, then in combination with your existing DB of sites, and compare results.

    Geraldine Van der Auwera, PhD

  • prepagamprepagam Posts: 19Member

    As per recommendations for non-human data, I've been running BQSR for multiple rounds to achieve convergence (which I define as the base qualities don't change any more and look recalibrated (i.e. on the diagonal line). I am doing this BQSR on each sample separately, and notice that different samples require different numbers of rounds of recalibration to get convergence (between 2-4) - this is consistent with the fact my data is from many sources. In order to automate my pipeline I'd like to have a specific number of rounds of BQSR I do on all samples i.e. 4 which is always enough. As such I wonder whether this might be problematic for the samples that really only needed 2 rounds of BQSR to converge? Or do I need to check each one manually after each round? And then do additional rounds as needed.

  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @prepagam

    Hello,

    This is not a problem. It is absolutely safe to do, as long as you don't mind spending the extra compute up front (as opposed to only doing >2 rounds when it looks necessary).

    -Sheila

  • tylor11tylor11 chinaPosts: 3Member

    @Geraldine_VdAuwera said: I would recommend manually creating a filtered subset of high-quality calls rather than relying on the LowQual annotation from the genotyper. See the variant filtering and selection tools for details on how to do this.

    i have follow your advise on knownsites.but some questions confued me. After rounds of non-recalibrated-Variant-calling and what BQSR result is right? Can i use the result to run VqSR as resourses ?

    -taylor

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @tylor11 Your questions confuse me... Can you please clarify what you are asking?

    Geraldine Van der Auwera, PhD

  • tylor11tylor11 chinaPosts: 3Member

    @Geraldine_VdAuwera said:

    My question is that how many rounds of non-recalibrated-Variant-calling and recalibrated-variant-calling should be done in order to get trustworthy results? what the 'trustworthy results' means? when i run some round of BQSR and get a lot of bam file that used to run a real snp calling . At last get a vcf file and the questions is that may i use the vcf file as an input file when i run VQSR?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @tylor11‌,

    Typically, people do 2 to 4 rounds of recalibration and calling after the initial unrecalibrated calling. The idea is that you stop when the recalibration plots look the same each time. Once that is done, you can proceed to recalibrate your variants. You can use a high-confidence subset of those variants as truth/training resource for VQSR, but it's up to you to decide what level of confidence to use as a prior.

    Geraldine Van der Auwera, PhD

  • shangzhong0619shangzhong0619 La JollaPosts: 3Member
    edited September 5

    @Geraldine_VdAuwera said: Number of rounds will depend on your dataset. More is better but at some point you begin to see convergence, and improvements are incrementally lower.

    Base recalibration performs an important function for the quality of your callset. Doing the iterative process described above is a lot of work and is not 'perfect', but it is better than skipping recalibration altogether.

    Hello, Geraldine, I have a question. I know in the first round, since we don't have dbsnp for non-human organism, we will use variant filter to manually filter the raw snps and indels, but in the second round, we would have "gold standard" snps and indels got from the first round. So in this round should we filter the raw snps and indels by using VQSR with the "gold standard" from round 1 as input, or still filter manually?

    Post edited by shangzhong0619 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @shangzhong0619,

    To clarify, we don't consider that the output of the bootstrapping rounds deserves to be called a "gold standard" because that would imply it is a very high quality callset. That is not quite true -- really it is just acceptable quality for BQSR.

    It is possible to use the same callset as a truth set for VQSR, but you need to have an estimate of how confident you are that all the sites in it are good, to calibrate the model correctly. This can be difficult to do and takes a bit of experimenting. All this to say, you can try, but the results may not be better than with manual filtering.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.