The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

BaseRecalibrator Out of Memory problem

I use GATK v2.5-2-gf57256b and ran into an our of memory problem when running the BaseRecalibrator.

** ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java**

I tried assigning increasingly large memory to the program and reduced the coverage to -dcov 40. The last try was with very large memory:

java -Xmx47g -jar /cc/apps/GATK/2.5.2/GenomeAnalysisTK.jar -T BaseRecalibrator -I ${line}real_calmd.bam -R ${huref} -knownSites kgp_vcf/ALL.wholeGenome_wo_wgs.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf -cov ReadGroupCovariate -cov CycleCovariate -cov ContextCovariate -cov QualityScoreCovariate -o ${line}_recal.csv -dcov 40 &>output${line}_qual_recal1

The program managed to run longer and longer as I increased the memory and decreased the coverage each time. The last run, with the 47gig and -dcov 40, ran for 90 min (with 6 days remaining) before crashing.

My BAM files are quite large (around 150 gig each). I did the recalibration previously with and older version of GATK using CountCovariates and it worked fine for these big Bam files. Is there anything I can do to make BaseRecalibrator work on these files also - since I would like to use the newer version of GATK for my whole pipeline

/Thanks, casch

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there,

    A couple of questions -- Are you running this on multiple samples at the same time? Are the different lanes of data clearly identified as such in the read groups?

  • I run it on one sample (individual) at a time, the bams are large because it high coverage whole human genomes
    all lanes are clearly identified in read groups.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, there is some overhead to feeding in all lanes for a sample at once, but it shouldn't be causing that much of an issue. Since you have deep sequencing you could try splitting up recalibration by chromosome, that would help reduce the memory issue.

  • I'm still struggling with the above problem. I tried many different things and cant get it to work. The BaseRecalibrator still gives the out of memory problem. Is there anybody having the same problem with large Bam files?

    Here is what I found out. (For everything below I used 23 gig assigned to java in a node with 24 gig ram available in total, I downsample to -dcov 40 and just used one individual)

    I tried:

    1) Running BaseRecalibrator on chr 1 only (-L 1): It ran for 102 min and finished around 12 million basepairs (5%) before it stuck and gave the error

    2) Running BaseRecalibrator on chr 1 only (-L 1) four times every time omitting one of the four covariates: Gave the same result as 1)

    3) Running BaseRecalibrator on chr 22 only (-L 22) - total of 49 million basepairs : It ran 90 min and finished 25 million basepairs (50%) before it gave out of memeory message

    4) Running BaseRecalibrator on chr MT only (-L MT) - It worked perfectly within a few seconds

    5) Running BaseRecalibrator on chr Y only (-L Y) - It worked perfectly within a minute

    6) Running BaseRecalibrator on chr 22 subsections adding 10 million basepairs each time
    10 million (-L 22 1-10000000) worked perfectly in 33 seconds
    20 million (-L 22:1-20000000) -worked perfectly in 6 minutes
    30 million (-L 22:1-30000000) again stuck at 25 million (same place as whole chr)

    7) I also first extracted only chr 22 from the bam and made a separate bam for it before reading it in to GATK BaseRecalibrator. But it behaved exactly like the -L 22 option with the whole file

    I noticed every time it is stuck for quite a while at the same place in the end before it gives the error message (see attached file of program output).

    I also attach the header of the input Bam if you can spot anything in there that could cause a problem

    Please help with some suggestions of what I can try further

    Also if this does not work in the end - I'll have to use the old CountCovariates (which did work) - is BaseRecalibrator much improved compared to CountCovariates?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there,

    I see in the headers you posted that your bam contains several read groups. The BaseRecalibrator (and CountCovariates) operates on each read group individually; while normally it shouldn't be a problem to run the tool on the multi-RG bam file, it does add some overhead, so maybe you will get better results (i.e. not get stuck) if you split up the file into separate read groups for processing.

    I'm not sure why BaseRecalibrator would fail where CountCovariates succeeds. BaseRecalibrator is very similar to CountCovariates, except it also builds a model of indel qualities in addition to single-base qualities.

  • Just for information. I tried to do the recal on separate read groups also (for separate samples and chromosomes) - but it still did not work.

    In the end I figured out what the problem was - It was the vcf file. I was using a merged vcf file from the KGP project. When I replaced the vcf by the dbsnp_137.b37.vcf file that is in the GATK bundle it worked perfectly. So it seems something was wrong with the KGP vcf file, and although it worked in CountCovariates it did not work in BaseRecalibrator.

Sign In or Register to comment.