**Heads up:**

We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

**Notice:**

If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

**Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.**

**Attention**:

We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.

Thank you for your patience!

# BQSR errors vs real variation

AlexanderV
BerlinMember ✭

When I have a known rate of 1 SNP every 30 basepairs and my reads (only counting the 'good' ones, used in BQSR) have 1e10 basepairs in total. So, I have `1e10 / 30 = 3.3e8`

SNPs on my reads.

Can I say that I expect BQSR to find `3.3e8 + epsilon`

errors when not masking any known SNPs?

This is connected to creating an own set of known SNPs for BQSR. I so far have the feeling from the reports, that the error rate is too high and my set should be bigger. Using the known rate of SNPs in my data as comparison to the Observations/Errors in the report and table from recalibration. So having the optimal set, I would get `3.3e8 + epsilon`

Observations and `epsilon`

Errors.

Am I right?

Tagged:

## Comments

@AlexanderV

Hi,

I think what you are saying is correct, but can you post an example output of the errors? It is true BQSR considers all mismatches to the reference as errors. And, the sites you input as known sites are masked and not considered as errors.

-Sheila

Hi @Sheila,

I missed your answer. Now I'm coming back to it.

I get the following Obs and Err values

The first is BQSR with a minimal set (referring to this [1] post and your answer).

The second comes from raw calling and filtering.

I'm still twisting my head around this:

The probability of an error decreases when

num mismatchesdecreases and/ornum observationsincreases.In my case nothing is increasing and the error (prob.) is decreasing. Therefore,

`Obs`

from the table must correspond tonum mismatchesand notnum observations.If that is true, you should change the documentation

The first set contains 70,326 variants (SNP only)

The second set contains 1,159,988 variants (overlap of 69,986 from the first set)

Something weird is, that the overlap should be 70,326, because I combined the first set (using CombineVariants on both sets with filteredrecordsmergetype=KEEP_UNCONDITIONAL and minimalVCF=true). Where are the other 340 variants? Is there an explanation? I'm counting with

`zgrep -vc '^#'`

.[1] http://gatkforums.broadinstitute.org/discussion/1243/what-are-the-standard-resources-for-non-human-genomes#Comment_23336

## Issue · Github
by Sheila

@AlexanderV I'm a bit surprised that you have more errors than observations. Would you be able to send us a copy of your BQSR table file?

We do assume that reference mismatches are errors, and p(error) = num mismatches / num observations, so I think the documentation is correct in that part.

@fleharty: There are not more errors. You might have read it wrong due to the punctuation. The errors have 2 digits after the decimal point.

Anyways: Here are the recal files.

first/second are the runs of BaseRecalibrator.

solcap is the minimal set, iter_1 is my created set.

What I mean about the documentation is, that in the formula you have the variables "p(error)", "num mismatches" and "num observations". In the text you mention "reference mismatches we see" which seems to correspond to "num observations". The report only gives "Obs" and "Err". So, It is not clear to me what "num mismatches" is here.

@AlexanderV Ah yes, thanks, you were right about the 2 digits after the decimal point.

The number of mismatches is closely related to the errors field in the recal table files.

If we consider line 124 from your file BQSR_plate3.noChim.plate3.iter_1.first.table

we see,

LB_1 6 M 8.0000 2079576 345581.61

There were 345581.61 errors, and 2079576 observations, In this case p(error) = 345581.61/2079576 = 0.1661789

if we compute round(-log10(0.1661789)*10) we get 8 which is the EmpiricalQuality.

Obs includes all the bases observed, both matches and mismatches. Err includes only the mismatched bases.

Does this help clear things up?

Thanks,

Mark

@AlexanderV

Hi,

This thread may also be of interest to you: http://gatkforums.broadinstitute.org/discussion/5046/how-to-compare-different-bqsr-runs

-Sheila

@fleharty yes. That clears things up very much. Thank you!

@Sheila indeed it is. Thank you!