The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

##### Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Parallelizing base quality score recalibration

Member

I'm attempting to use the BaseRecalibrator tool for 30-50x depth whole genome datasets with BAM files of around 100 - 150GB. However it is very computationally demanding so I'd really like to distribute the processing over many cores on our cluster. I've done this for the indel realignment process by running for each chromosome separately as described in the now retired guidelines on "Parallelism with the GATK" (I think a new version is due to be issued at some point). It's less clear, to me at least, how to do this for the BaseRecalibrator.

For example, is it possible to combine GATKReports for the recalibration data generated for separate chromosomes? Or should I run the on-the-fly recalibration with PrintReads and the -BQSR option using the recalibration data for each chromosome separately? If the latter, does it matter that for some of the smaller unplaced/unlocalized chromosomes the recalibration tables will contain values for covariates generated with only a few observations? The documentation on the Base Quality Score Recalibrator seems to suggest that the recalibration tables need to be calculated over the whole genome.

Thanks,
Matt

Tagged:

## Answers

• Cambridge, MAMember, Administrator, Broadie

Hi Matt,

BaseRecalibrator is harder to parallelize because as you correctly read in the documentation, it works much better if it has access to the whole genome data. The recalibration model depends on having a lot of observations; if you reduce the number of observations you can dramatically reduce the effectiveness of the tool.

The new parallelism document will include our best recommendations for parallelizing all of our major tools including BaseRecalibrator. As for its ETA, it is indeed due to come out soon; it got a little bit delayed because we've been working hard on preparing the workshop (which we're in the middle of right now), but it'll be out next week for sure.

• Member

Many thanks Geraldine, I'll look out for the new parallelism guide.
Matt

• Member

Hi Geraldine,

I've been looking at the new parallelism primer and accompanying guide on specific usage recommendations for various GATK tools. From these guides and another response by Eric Banks to a similar post (http://gatkforums.broadinstitute.org/discussion/1458/countcovariates-timings), it looks like it should be possible to run a scatter-gather approach and I'd be interested in any further suggestions for how to go about this.

For example, can I divide the reference genome into a number of segments and run the BaseRecalibrator separately on each one? Would I then apply the calibration table using PrintReads to that segment or would I be able to somehow combine the calibration data into a single calibration table to apply to the whole genome?

To avoid the problem of only having a few observations for the smaller unplaced/unlocalized contigs, should I instead divide the genome into roughly equal-size segments and would it matter if those spanned across chromosome boundaries?

Or have I got it completely wrong and there is another way of applying the scatter-gather approach that doesn't involve chunking the reference genome?

Thanks in advance,
Matt

• Cambridge, MAMember, Administrator, Broadie

Hi Matt,

With Queue, the scattering is decided for you with the @PartitionBy(PartitionType.READ) annotation which says that each read can be processed individually. Without Queue though, since BaseRecalibrator is a read walker, it can only be scattered by contig so that reads aren't counted more than once.

• Member

Many thanks Geraldine and Mauricio, this is very helpful.

As you may have gathered, the reason I'm asking is because we're using another workflow management system. It looks like the source code for the relevant classes (BQSRGatherer, RecalUtils, etc.) are available from the GitHub site so I'll take a look and see if I can work out how to get this defined as a task in our workflow system. Of course, any help or suggestions on how to run this standalone (separate from Queue) would be gratefully received.

Matt

• Member

Matt (or anyone else), were you able to get BQSRGatherer working with your workflow management system? We're in the same boat and any tips would be appreciated.

Thanks!

• Member

Hi egafni,
I'm afraid other priorities have got in the way of making any further progress with this but I still intend to return to it when I get the chance.
Matt

• Cambridge, MAMember, Administrator, Broadie

Thanks for sharing your solution, @mmoisse.

• Member

Is the recalibrated base quality in BQSR bam files could only be read by GATK tools?

• Cambridge, MAMember, Administrator, Broadie

Hi @srivas31,

I'm not sure exactly what you are asking. Once you complete the recalibration process, the recalibrated base qualities replace the original qualities, and are readable by any program that reads valid BAM files.

• Member
edited April 2013

Hi,

I'm testing the solution of mmoisse for scatter-gathering the BQSR procedure, following these steps:

• Run for Gene Panel targets on each chromosome separately, for all samples on a MiSeq run together (multiple -I params) : -T baseRecalibrator -nct 8 -I Sample1.bam -I Sample2.bam -I SampleXX(1-15).bam -o recal.chrX.grp -L Targets.chrX.bed

• Join the grp files using the java snippet above.

However, I get different values compared to running the baseRecalibrator on the full panel.bed file (all chromosomes). Is it normal to have different number of observations after scatter-gather compared to the full run? I have added a small subset of values below:

=> FULL TARGET BED FILE

ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors

VSD-550-1_S9.L001.1 M 27.0000 29.0518 226637266 421742.43

VSD-550-1_S9.L001.1 I 44.0000 45.0000 226637266 10075.27

VSD-550-1_S9.L001.1 D 38.0000 45.0000 226637266 37670.18

VDP-480-1_S3.L001.1 M 27.0000 29.0699 252989865 459948.36

VDP-480-1_S3.L001.1 I 47.0000 45.0000 252989865 4789.00

VDP-480-1_S3.L001.1 D 38.0000 45.0000 252989865 41935.60

=> Per Chr BQSR + JOIN GRP

ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors

VSD-550-1_S9.L001.1 M 27.0000 29.1196 245566293 450173.53

VSD-550-1_S9.L001.1 I 46.0000 45.0000 245566293 6663.20

VSD-550-1_S9.L001.1 D 38.0000 45.0000 245566293 43416.70

VDP-480-1_S3.L001.1 M 27.0000 29.1318 241318168 440363.72

VDP-480-1_S3.L001.1 I 45.0000 45.0000 241318168 8093.53

VDP-480-1_S3.L001.1 D 38.0000 45.0000 241318168 42638.13

There are 15 samples, each in a readgroup in the test panel, For some RGs, i see more observations in the full vs joined run, for other readgroups, it's the other way around.

I use GATK 2.4.9-g532efad

• Cambridge, MAMember, Administrator, Broadie

Hi Geert,

Unfortunately I can't explain these big differences in the number of observations. We'd expect some differences because of downsampling, but the numbers should be closer than what you're seeing. We'll do a couple of tests internally to see if we can reproduce this behavior. I'll keep you posted on what we find.

• Member

Hi,

Thanks for the reply. If this is indeed related to downsampling, it might be important to mention that the average base coverage is about 1200x for our dataset.

I will rerun the analysis (split+ join vs full BQSR) without downsampling (dt=NONE and dfrac=1) and report the results here.

best,
geert

• Cambridge, MAMember, Administrator, Broadie

OK, let me know how it goes. FYI we've run some checks on our end using Queue and we don't see any differences in the observation counts. Maybe some are getting lost when you join the grp tables?

• Member

These are the results for running BQSR with "-dt NONE' option. The difference seems to be smaller, but it's still there.

Should the java snippet from above run the exact same merging procedure as Queue? Or can I provide extra arguments somehow?

=> JOINED

VSD-550-1_S9.L001.1 M 27.0000 29.0632 225951155 418512.76

VSD-550-1_S9.L001.1 I 43.0000 45.0000 225951155 12704.96

VSD-550-1_S9.L001.1 D 38.0000 45.0000 225951155 36649.58

VDP-480-1_S3.L001.1 M 27.0000 29.1282 236898453 444426.25

VDP-480-1_S3.L001.1 I 46.0000 45.0000 236898453 5473.07

VDP-480-1_S3.L001.1 D 37.0000 45.0000 236898453 44294.48

=> FULL

VSD-550-1_S9.L001.1 M 27.0000 29.0518 226636962 421742.43

VSD-550-1_S9.L001.1 I 44.0000 45.0000 226636962 10075.27

VSD-550-1_S9.L001.1 D 38.0000 45.0000 226636962 37670.18

VDP-480-1_S3.L001.1 M 27.0000 29.0699 252989713 459948.70

VDP-480-1_S3.L001.1 I 47.0000 45.0000 252989713 4789.00

VDP-480-1_S3.L001.1 D 38.0000 45.0000 252989713 41935.60

• Member

I'm also getting some quirky behavior using @mmoisse's script
If you ask it to combine a single recal file, it produces a "combined" recal file with significant differences!

Has anyone identified a solution to this issue?

• Member

I updated my version of GATK, and things got better for the most part.
http://gatkforums.broadinstitute.org/discussion/1560/notice-bug-affecting-bqsr-run-with-queue-scatter-gather

when a diff a file that I run gather on, the output is still non-identical, at least on a tiny dataset. The difference is 1 integer in 1 row (column is empirical quality).

• Member

I finally got around to using BQSRGatherer to merge recalibration tables and am happy to say that it is working for me. This is using a different workflow management system, not GATK-Queue. I'm running GATK 2.4.9 on human data sets using the version 2.5 resource bundle (hg19). following the recommendations for known sites in http://www.broadinstitute.org/gatk/guide/article?id=1247, and only get very minor differences (second decimal place) in values in the Errors columns.

Many thanks to Mauricio Carneiro and mmoisse for pointing me in the right direction.
Matt

• Cambridge, MAMember, Administrator, Broadie

Good to hear, thanks for reporting back on your results. WIll pass on your thanks to @Carneiro, who is always happy to be helpful

Sign In or Register to comment.