We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Parallelizing base quality score recalibration

I'm attempting to use the BaseRecalibrator tool for 30-50x depth whole genome datasets with BAM files of around 100 - 150GB. However it is very computationally demanding so I'd really like to distribute the processing over many cores on our cluster. I've done this for the indel realignment process by running for each chromosome separately as described in the now retired guidelines on "Parallelism with the GATK" (I think a new version is due to be issued at some point). It's less clear, to me at least, how to do this for the BaseRecalibrator.

For example, is it possible to combine GATKReports for the recalibration data generated for separate chromosomes? Or should I run the on-the-fly recalibration with PrintReads and the -BQSR option using the recalibration data for each chromosome separately? If the latter, does it matter that for some of the smaller unplaced/unlocalized chromosomes the recalibration tables will contain values for covariates generated with only a few observations? The documentation on the Base Quality Score Recalibrator seems to suggest that the recalibration tables need to be calculated over the whole genome.


Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Matt,

    BaseRecalibrator is harder to parallelize because as you correctly read in the documentation, it works much better if it has access to the whole genome data. The recalibration model depends on having a lot of observations; if you reduce the number of observations you can dramatically reduce the effectiveness of the tool.

    The new parallelism document will include our best recommendations for parallelizing all of our major tools including BaseRecalibrator. As for its ETA, it is indeed due to come out soon; it got a little bit delayed because we've been working hard on preparing the workshop (which we're in the middle of right now), but it'll be out next week for sure.

  • mdemde Member

    Many thanks Geraldine, I'll look out for the new parallelism guide.

  • mdemde Member

    Hi Geraldine,

    I've been looking at the new parallelism primer and accompanying guide on specific usage recommendations for various GATK tools. From these guides and another response by Eric Banks to a similar post (http://gatkforums.broadinstitute.org/discussion/1458/countcovariates-timings), it looks like it should be possible to run a scatter-gather approach and I'd be interested in any further suggestions for how to go about this.

    For example, can I divide the reference genome into a number of segments and run the BaseRecalibrator separately on each one? Would I then apply the calibration table using PrintReads to that segment or would I be able to somehow combine the calibration data into a single calibration table to apply to the whole genome?

    To avoid the problem of only having a few observations for the smaller unplaced/unlocalized contigs, should I instead divide the genome into roughly equal-size segments and would it matter if those spanned across chromosome boundaries?

    Or have I got it completely wrong and there is another way of applying the scatter-gather approach that doesn't involve chunking the reference genome?

    Thanks in advance,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Matt,

    With Queue, the scattering is decided for you with the @PartitionBy(PartitionType.READ) annotation which says that each read can be processed individually. Without Queue though, since BaseRecalibrator is a read walker, it can only be scattered by contig so that reads aren't counted more than once.

  • mdemde Member

    Many thanks Geraldine and Mauricio, this is very helpful.

    As you may have gathered, the reason I'm asking is because we're using another workflow management system. It looks like the source code for the relevant classes (BQSRGatherer, RecalUtils, etc.) are available from the GitHub site so I'll take a look and see if I can work out how to get this defined as a task in our workflow system. Of course, any help or suggestions on how to run this standalone (separate from Queue) would be gratefully received.


  • egafniegafni Member

    Matt (or anyone else), were you able to get BQSRGatherer working with your workflow management system? We're in the same boat and any tips would be appreciated.


  • mdemde Member

    Hi egafni,
    I'm afraid other priorities have got in the way of making any further progress with this but I still intend to return to it when I get the chance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for sharing your solution, @mmoisse.

  • srivas31srivas31 Member

    Is the recalibrated base quality in BQSR bam files could only be read by GATK tools?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @srivas31,

    I'm not sure exactly what you are asking. Once you complete the recalibration process, the recalibrated base qualities replace the original qualities, and are readable by any program that reads valid BAM files.

  • geertvandeweyergeertvandeweyer Member
    edited April 2013


    I'm testing the solution of mmoisse for scatter-gathering the BQSR procedure, following these steps:

    • Run for Gene Panel targets on each chromosome separately, for all samples on a MiSeq run together (multiple -I params) : -T baseRecalibrator -nct 8 -I Sample1.bam -I Sample2.bam -I SampleXX(1-15).bam -o recal.chrX.grp -L Targets.chrX.bed

    • Join the grp files using the java snippet above.

    However, I get different values compared to running the baseRecalibrator on the full panel.bed file (all chromosomes). Is it normal to have different number of observations after scatter-gather compared to the full run? I have added a small subset of values below:


    ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors

    VSD-550-1_S9.L001.1 M 27.0000 29.0518 226637266 421742.43

    VSD-550-1_S9.L001.1 I 44.0000 45.0000 226637266 10075.27

    VSD-550-1_S9.L001.1 D 38.0000 45.0000 226637266 37670.18

    VDP-480-1_S3.L001.1 M 27.0000 29.0699 252989865 459948.36

    VDP-480-1_S3.L001.1 I 47.0000 45.0000 252989865 4789.00

    VDP-480-1_S3.L001.1 D 38.0000 45.0000 252989865 41935.60

    => Per Chr BQSR + JOIN GRP

    ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors

    VSD-550-1_S9.L001.1 M 27.0000 29.1196 245566293 450173.53

    VSD-550-1_S9.L001.1 I 46.0000 45.0000 245566293 6663.20

    VSD-550-1_S9.L001.1 D 38.0000 45.0000 245566293 43416.70

    VDP-480-1_S3.L001.1 M 27.0000 29.1318 241318168 440363.72

    VDP-480-1_S3.L001.1 I 45.0000 45.0000 241318168 8093.53

    VDP-480-1_S3.L001.1 D 38.0000 45.0000 241318168 42638.13

    There are 15 samples, each in a readgroup in the test panel, For some RGs, i see more observations in the full vs joined run, for other readgroups, it's the other way around.

    I use GATK 2.4.9-g532efad

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Geert,

    Unfortunately I can't explain these big differences in the number of observations. We'd expect some differences because of downsampling, but the numbers should be closer than what you're seeing. We'll do a couple of tests internally to see if we can reproduce this behavior. I'll keep you posted on what we find.

  • Hi,

    Thanks for the reply. If this is indeed related to downsampling, it might be important to mention that the average base coverage is about 1200x for our dataset.

    I will rerun the analysis (split+ join vs full BQSR) without downsampling (dt=NONE and dfrac=1) and report the results here.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, let me know how it goes. FYI we've run some checks on our end using Queue and we don't see any differences in the observation counts. Maybe some are getting lost when you join the grp tables?

  • These are the results for running BQSR with "-dt NONE' option. The difference seems to be smaller, but it's still there.

    Should the java snippet from above run the exact same merging procedure as Queue? Or can I provide extra arguments somehow?

    => JOINED

    VSD-550-1_S9.L001.1 M 27.0000 29.0632 225951155 418512.76

    VSD-550-1_S9.L001.1 I 43.0000 45.0000 225951155 12704.96

    VSD-550-1_S9.L001.1 D 38.0000 45.0000 225951155 36649.58

    VDP-480-1_S3.L001.1 M 27.0000 29.1282 236898453 444426.25

    VDP-480-1_S3.L001.1 I 46.0000 45.0000 236898453 5473.07

    VDP-480-1_S3.L001.1 D 37.0000 45.0000 236898453 44294.48

    => FULL

    VSD-550-1_S9.L001.1 M 27.0000 29.0518 226636962 421742.43

    VSD-550-1_S9.L001.1 I 44.0000 45.0000 226636962 10075.27

    VSD-550-1_S9.L001.1 D 38.0000 45.0000 226636962 37670.18

    VDP-480-1_S3.L001.1 M 27.0000 29.0699 252989713 459948.70

    VDP-480-1_S3.L001.1 I 47.0000 45.0000 252989713 4789.00

    VDP-480-1_S3.L001.1 D 38.0000 45.0000 252989713 41935.60

  • egafniegafni Member

    I'm also getting some quirky behavior using @mmoisse's script
    If you ask it to combine a single recal file, it produces a "combined" recal file with significant differences!

    Has anyone identified a solution to this issue?

  • egafniegafni Member

    I updated my version of GATK, and things got better for the most part.

    when a diff a file that I run gather on, the output is still non-identical, at least on a tiny dataset. The difference is 1 integer in 1 row (column is empirical quality).

  • mdemde Member

    I finally got around to using BQSRGatherer to merge recalibration tables and am happy to say that it is working for me. This is using a different workflow management system, not GATK-Queue. I'm running GATK 2.4.9 on human data sets using the version 2.5 resource bundle (hg19). following the recommendations for known sites in http://www.broadinstitute.org/gatk/guide/article?id=1247, and only get very minor differences (second decimal place) in values in the Errors columns.

    Many thanks to Mauricio Carneiro and mmoisse for pointing me in the right direction.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Good to hear, thanks for reporting back on your results. WIll pass on your thanks to @Carneiro, who is always happy to be helpful :)

Sign In or Register to comment.