Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

BQSR errors vs real variation

AlexanderVAlexanderV BerlinMember
edited August 2015 in Ask the GATK team

When I have a known rate of 1 SNP every 30 basepairs and my reads (only counting the 'good' ones, used in BQSR) have 1e10 basepairs in total. So, I have 1e10 / 30 = 3.3e8 SNPs on my reads.
Can I say that I expect BQSR to find 3.3e8 + epsilon errors when not masking any known SNPs?

This is connected to creating an own set of known SNPs for BQSR. I so far have the feeling from the reports, that the error rate is too high and my set should be bigger. Using the known rate of SNPs in my data as comparison to the Observations/Errors in the report and table from recalibration. So having the optimal set, I would get 3.3e8 + epsilon Observations and epsilon Errors.
Am I right?



  • SheilaSheila Broad InstituteMember, Broadie admin


    I think what you are saying is correct, but can you post an example output of the errors? It is true BQSR considers all mismatches to the reference as errors. And, the sites you input as known sites are masked and not considered as errors.


  • AlexanderVAlexanderV BerlinMember

    Hi @Sheila,

    I missed your answer. Now I'm coming back to it.

    I get the following Obs and Err values

    Obs             Err
    10,160,023,994  281,306,385.20  minimal
     8,450,302,946  131,198,570.97  own from raw call

    The first is BQSR with a minimal set (referring to this [1] post and your answer).
    The second comes from raw calling and filtering.

    I'm still twisting my head around this:

    We assume that all reference mismatches we see are therefore errors [...]
    [...] empirical probability of error [...], where p(error) = num mismatches / num observations.

    The probability of an error decreases when num mismatches decreases and/or num observations increases.
    In my case nothing is increasing and the error (prob.) is decreasing. Therefore, Obs from the table must correspond to num mismatches and not num observations.
    If that is true, you should change the documentation ;)

    The first set contains 70,326 variants (SNP only)
    The second set contains 1,159,988 variants (overlap of 69,986 from the first set)
    Something weird is, that the overlap should be 70,326, because I combined the first set (using CombineVariants on both sets with filteredrecordsmergetype=KEEP_UNCONDITIONAL and minimalVCF=true). Where are the other 340 variants? Is there an explanation? I'm counting with zgrep -vc '^#'.

    [1] http://gatkforums.broadinstitute.org/discussion/1243/what-are-the-standard-resources-for-non-human-genomes#Comment_23336

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • flehartyfleharty Member, Broadie, Dev ✭✭

    @AlexanderV I'm a bit surprised that you have more errors than observations. Would you be able to send us a copy of your BQSR table file?

    We do assume that reference mismatches are errors, and p(error) = num mismatches / num observations, so I think the documentation is correct in that part.

  • AlexanderVAlexanderV BerlinMember

    @fleharty: There are not more errors. You might have read it wrong due to the punctuation. The errors have 2 digits after the decimal point.
    Anyways: Here are the recal files.
    first/second are the runs of BaseRecalibrator.
    solcap is the minimal set, iter_1 is my created set.

    What I mean about the documentation is, that in the formula you have the variables "p(error)", "num mismatches" and "num observations". In the text you mention "reference mismatches we see" which seems to correspond to "num observations". The report only gives "Obs" and "Err". So, It is not clear to me what "num mismatches" is here.

  • flehartyfleharty Member, Broadie, Dev ✭✭

    @AlexanderV Ah yes, thanks, you were right about the 2 digits after the decimal point.

    The number of mismatches is closely related to the errors field in the recal table files.

    If we consider line 124 from your file BQSR_plate3.noChim.plate3.iter_1.first.table
    we see,
    LB_1 6 M 8.0000 2079576 345581.61

    There were 345581.61 errors, and 2079576 observations, In this case p(error) = 345581.61/2079576 = 0.1661789
    if we compute round(-log10(0.1661789)*10) we get 8 which is the EmpiricalQuality.

    Obs includes all the bases observed, both matches and mismatches. Err includes only the mismatched bases.

    Does this help clear things up?



  • SheilaSheila Broad InstituteMember, Broadie admin
  • AlexanderVAlexanderV BerlinMember

    @fleharty yes. That clears things up very much. Thank you!
    @Sheila indeed it is. Thank you!

Sign In or Register to comment.