Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BQSR errors vs real variation

AlexanderVAlexanderV BerlinMember
edited August 2015 in Ask the GATK team

When I have a known rate of 1 SNP every 30 basepairs and my reads (only counting the 'good' ones, used in BQSR) have 1e10 basepairs in total. So, I have 1e10 / 30 = 3.3e8 SNPs on my reads.
Can I say that I expect BQSR to find 3.3e8 + epsilon errors when not masking any known SNPs?

This is connected to creating an own set of known SNPs for BQSR. I so far have the feeling from the reports, that the error rate is too high and my set should be bigger. Using the known rate of SNPs in my data as comparison to the Observations/Errors in the report and table from recalibration. So having the optimal set, I would get 3.3e8 + epsilon Observations and epsilon Errors.
Am I right?

Tagged:

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @AlexanderV
    Hi,

    I think what you are saying is correct, but can you post an example output of the errors? It is true BQSR considers all mismatches to the reference as errors. And, the sites you input as known sites are masked and not considered as errors.

    -Sheila

  • AlexanderVAlexanderV BerlinMember

    Hi @Sheila,

    I missed your answer. Now I'm coming back to it.

    I get the following Obs and Err values

    Obs             Err
    10,160,023,994  281,306,385.20  minimal
     8,450,302,946  131,198,570.97  own from raw call
    

    The first is BQSR with a minimal set (referring to this [1] post and your answer).
    The second comes from raw calling and filtering.

    I'm still twisting my head around this:

    We assume that all reference mismatches we see are therefore errors [...]
    [...] empirical probability of error [...], where p(error) = num mismatches / num observations.

    The probability of an error decreases when num mismatches decreases and/or num observations increases.
    In my case nothing is increasing and the error (prob.) is decreasing. Therefore, Obs from the table must correspond to num mismatches and not num observations.
    If that is true, you should change the documentation ;)

    The first set contains 70,326 variants (SNP only)
    The second set contains 1,159,988 variants (overlap of 69,986 from the first set)
    Something weird is, that the overlap should be 70,326, because I combined the first set (using CombineVariants on both sets with filteredrecordsmergetype=KEEP_UNCONDITIONAL and minimalVCF=true). Where are the other 340 variants? Is there an explanation? I'm counting with zgrep -vc '^#'.

    [1] http://gatkforums.broadinstitute.org/discussion/1243/what-are-the-standard-resources-for-non-human-genomes#Comment_23336

    Issue · Github
    by Sheila

    Issue Number
    149
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    fleharty
  • flehartyfleharty Member, Broadie, Dev ✭✭

    @AlexanderV I'm a bit surprised that you have more errors than observations. Would you be able to send us a copy of your BQSR table file?

    We do assume that reference mismatches are errors, and p(error) = num mismatches / num observations, so I think the documentation is correct in that part.

  • AlexanderVAlexanderV BerlinMember

    @fleharty: There are not more errors. You might have read it wrong due to the punctuation. The errors have 2 digits after the decimal point.
    Anyways: Here are the recal files.
    first/second are the runs of BaseRecalibrator.
    solcap is the minimal set, iter_1 is my created set.

    What I mean about the documentation is, that in the formula you have the variables "p(error)", "num mismatches" and "num observations". In the text you mention "reference mismatches we see" which seems to correspond to "num observations". The report only gives "Obs" and "Err". So, It is not clear to me what "num mismatches" is here.

  • flehartyfleharty Member, Broadie, Dev ✭✭

    @AlexanderV Ah yes, thanks, you were right about the 2 digits after the decimal point.

    The number of mismatches is closely related to the errors field in the recal table files.

    If we consider line 124 from your file BQSR_plate3.noChim.plate3.iter_1.first.table
    we see,
    LB_1 6 M 8.0000 2079576 345581.61

    There were 345581.61 errors, and 2079576 observations, In this case p(error) = 345581.61/2079576 = 0.1661789
    if we compute round(-log10(0.1661789)*10) we get 8 which is the EmpiricalQuality.

    Obs includes all the bases observed, both matches and mismatches. Err includes only the mismatched bases.

    Does this help clear things up?

    Thanks,

    Mark

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
  • AlexanderVAlexanderV BerlinMember

    @fleharty yes. That clears things up very much. Thank you!
    @Sheila indeed it is. Thank you!

Sign In or Register to comment.