To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

How to compare different BQSR runs?

I am trying BQSR with different known SNP files. How do I know if one BQSR run is better than the other?

Is RMSE the right criteria, ie the smaller the better?

I am calculating RMSE by summing (ObservationsAccuracyAccuracy) for one category (e.g. QualityScore for Base Substitution) in the csv generated by AnalyzeCovariates, then I divide it by total Observations and square root it to obtain RMSE. Is this the right way to calculate RMSE?

I noticed that different runs gave me different number of total Observations. Why is that?

What about Errors in the csv file? What does it mean? Can it be used to determine which BQSR run is better?

Thanks a lot in advance!


  • SheilaSheila Broad InstituteMember, Broadie, Moderator



    Sorry for the delayed response!

    You are correct that the smaller the RMSE, the better. However, I will need @rpoplin‌ to jump in here and clarify what the calculation for RMSE is.

    The different total observations are from the different input known sites files. BQSR skips over the positions in the known sites files, so if the input known sites files have different numbers of variants sites, that will affect the number of observations.

    Errors in the csv file are the number of times the program observed differences between a base call and the reference allele at that site. It is equal to the number of observations it considered wrong, which is what it feeds to the model.


  • ymcymc Member

    Thanks for your reply. I found that Errors in my csv files are not integers but floating point numbers. Why is that???

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Can you post an example record of the errors that are floating point numbers?


  • This is an excerpt of a csv file I get from AnalyzeCovariates. The sixth field is Errors:

    exome,AA,Context,Base Substitution,417693517,597145.30,28.00,27.52,0.48,Before
    exome,AAA,Context,Base Insertion,139023947,7910.32,42.00,45.00,-3.00,Before
    exome,AAA,Context,Base Deletion,139023947,20980.80,38.00,45.00,-7.00,Before
    exome,CA,Context,Base Substitution,426740636,421427.11,30.00,29.05,0.95,Before
    exome,CAA,Context,Base Insertion,91199705,1091.18,49.00,45.00,4.00,Before
    exome,CAA,Context,Base Deletion,91199705,1854.61,47.00,45.00,2.00,Before
    exome,GA,Context,Base Substitution,358651596,793344.53,27.00,26.29,0.71,Before
    exome,GAA,Context,Base Insertion,110204691,756.92,51.00,45.00,6.00,Before
    exome,GAA,Context,Base Deletion,110204691,1777.99,48.00,45.00,3.00,Before
    exome,TA,Context,Base Substitution,242735980,488847.19,27.00,26.55,0.45,Before
    exome,TAA,Context,Base Insertion,69770205,925.20,49.00,45.00,4.00,Before
    exome,TAA,Context,Base Deletion,69770205,1986.92,45.00,45.00,0.00,Before
    exome,AC,Context,Base Substitution,276176933,766355.49,26.00,25.53,0.47,Before
    exome,ACA,Context,Base Insertion,99406365,1049.58,49.00,45.00,4.00,Before
    exome,ACA,Context,Base Deletion,99406365,2017.76,47.00,45.00,2.00,Before
    exome,CC,Context,Base Substitution,377365986,896139.17,26.00,26.20,-0.20,Before
    exome,CCA,Context,Base Insertion,120755217,646.46,52.00,45.00,7.00,Before
    exome,CCA,Context,Base Deletion,120755217,1406.21,49.00,45.00,4.00,Before
    exome,GC,Context,Base Substitution,306898561,1329671.30,24.00,24.89,-0.89,Before
    exome,GCA,Context,Base Insertion,93622337,600.96,51.00,45.00,6.00,Before
    exome,GCA,Context,Base Deletion,93622337,1119.10,49.00,45.00,4.00,Before

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sorry for the delay, we've just been really busy for the past few weeks. I think I may have misinformed @Sheila about what the file contains by mistake. She will check with the developers and correct my mistake as appropriate.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Thanks for being patient. I have a response from our developers.

    The errors are floating point numbers because we are fractionally distributing the errors when there are multiple possible alignments for the reads. This is done using the BAQ calculation from Heng Li.

    The goal is to not trust the actual alignment of the read too much. We showed that this procedure slightly improves the accuracy of the recalibration.


  • ymcymc Member

    Thanks for your reply. I will take a look at the paper

    So getting back to the original question: how to calculate RMSE from the csv file?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Sorry for the delayed response.

    RMSE = sqrt(sum((QualityScore-EmpiricalScore)^2 * numberOfObservations) / sum(Observations))


  • ymcymc Member

    Thanks for your reply.

    If the formula is put in terms of the names of the title fields in the csv file, then it becomes

    RMSE = sqrt(sum((AverageReportedQuality - EmpiricalQuality)^2 * Observations) / sum(Observations))

    Since the 8th field Accuracy is the same as AverageReportedQuality - EmpiricalQuality,

    RMSE = sqrt(sum(Accuracy * Accuracy * Observations) / sum(Observations))

    This is the formula I used in the first post. So I suppose you guys and I are using the same definition of RMSE. Good.

  • ymcymc Member

    Another related question:

    So I am getting nine RMSEs for each run. It is a 3x3 of (Quality, Context, Cycle) x (Base Substitution, Insertion, Deletion)

    I think the purpose of BQSR is to replace the original quality with a better quality score. Does that mean the RMSEs of Quality are more important than RMSEs of Context and Cycle?

    Even within RMSEs of Quality, is it true that Subs Quality > Ins Qual > Del Qual in terms of importance because deletions has no quality score from the machine?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No, all covariates are treated equally. We want to minimize the errors on all three.

    Similarly, we do not prioritize the quality by type of variant.

Sign In or Register to comment.