Notice:

#### Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

# BQSR errors vs real variation

BerlinMember
edited August 2015

When I have a known rate of 1 SNP every 30 basepairs and my reads (only counting the 'good' ones, used in BQSR) have 1e10 basepairs in total. So, I have `1e10 / 30 = 3.3e8` SNPs on my reads.
Can I say that I expect BQSR to find `3.3e8 + epsilon` errors when not masking any known SNPs?

This is connected to creating an own set of known SNPs for BQSR. I so far have the feeling from the reports, that the error rate is too high and my set should be bigger. Using the known rate of SNPs in my data as comparison to the Observations/Errors in the report and table from recalibration. So having the optimal set, I would get `3.3e8 + epsilon` Observations and `epsilon` Errors.
Am I right?

Tagged:

@AlexanderV
Hi,

I think what you are saying is correct, but can you post an example output of the errors? It is true BQSR considers all mismatches to the reference as errors. And, the sites you input as known sites are masked and not considered as errors.

-Sheila

• BerlinMember

Hi @Sheila,

I get the following Obs and Err values

``````Obs             Err
10,160,023,994  281,306,385.20  minimal
8,450,302,946  131,198,570.97  own from raw call
``````

The first is BQSR with a minimal set (referring to this [1] post and your answer).
The second comes from raw calling and filtering.

I'm still twisting my head around this:

We assume that all reference mismatches we see are therefore errors [...]
[...] empirical probability of error [...], where p(error) = num mismatches / num observations.

The probability of an error decreases when num mismatches decreases and/or num observations increases.
In my case nothing is increasing and the error (prob.) is decreasing. Therefore, `Obs` from the table must correspond to num mismatches and not num observations.
If that is true, you should change the documentation

The first set contains 70,326 variants (SNP only)
The second set contains 1,159,988 variants (overlap of 69,986 from the first set)
Something weird is, that the overlap should be 70,326, because I combined the first set (using CombineVariants on both sets with filteredrecordsmergetype=KEEP_UNCONDITIONAL and minimalVCF=true). Where are the other 340 variants? Is there an explanation? I'm counting with `zgrep -vc '^#'`.

#### Issue · Github August 2015 by Sheila

Issue Number
149
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
fleharty

@AlexanderV I'm a bit surprised that you have more errors than observations. Would you be able to send us a copy of your BQSR table file?

We do assume that reference mismatches are errors, and p(error) = num mismatches / num observations, so I think the documentation is correct in that part.

• BerlinMember

@fleharty: There are not more errors. You might have read it wrong due to the punctuation. The errors have 2 digits after the decimal point.
Anyways: Here are the recal files.
first/second are the runs of BaseRecalibrator.
solcap is the minimal set, iter_1 is my created set.

What I mean about the documentation is, that in the formula you have the variables "p(error)", "num mismatches" and "num observations". In the text you mention "reference mismatches we see" which seems to correspond to "num observations". The report only gives "Obs" and "Err". So, It is not clear to me what "num mismatches" is here.

@AlexanderV Ah yes, thanks, you were right about the 2 digits after the decimal point.

The number of mismatches is closely related to the errors field in the recal table files.

If we consider line 124 from your file BQSR_plate3.noChim.plate3.iter_1.first.table
we see,
LB_1 6 M 8.0000 2079576 345581.61

There were 345581.61 errors, and 2079576 observations, In this case p(error) = 345581.61/2079576 = 0.1661789
if we compute round(-log10(0.1661789)*10) we get 8 which is the EmpiricalQuality.

Obs includes all the bases observed, both matches and mismatches. Err includes only the mismatched bases.

Does this help clear things up?

Thanks,

Mark