Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BaseRecalibrator Out of Memory problem

I use GATK v2.5-2-gf57256b and ran into an our of memory problem when running the BaseRecalibrator.

** ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java**

I tried assigning increasingly large memory to the program and reduced the coverage to -dcov 40. The last try was with very large memory:

java -Xmx47g -jar /cc/apps/GATK/2.5.2/GenomeAnalysisTK.jar -T BaseRecalibrator -I ${line}real_calmd.bam -R ${huref} -knownSites kgp_vcf/ALL.wholeGenome_wo_wgs.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf -cov ReadGroupCovariate -cov CycleCovariate -cov ContextCovariate -cov QualityScoreCovariate -o ${line}_recal.csv -dcov 40 &>output${line}_qual_recal1

The program managed to run longer and longer as I increased the memory and decreased the coverage each time. The last run, with the 47gig and -dcov 40, ran for 90 min (with 6 days remaining) before crashing.

My BAM files are quite large (around 150 gig each). I did the recalibration previously with and older version of GATK using CountCovariates and it worked fine for these big Bam files. Is there anything I can do to make BaseRecalibrator work on these files also - since I would like to use the newer version of GATK for my whole pipeline

/Thanks, casch

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    A couple of questions -- Are you running this on multiple samples at the same time? Are the different lanes of data clearly identified as such in the read groups?

  • caschcasch Member

    I run it on one sample (individual) at a time, the bams are large because it high coverage whole human genomes
    all lanes are clearly identified in read groups.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, there is some overhead to feeding in all lanes for a sample at once, but it shouldn't be causing that much of an issue. Since you have deep sequencing you could try splitting up recalibration by chromosome, that would help reduce the memory issue.

  • caschcasch Member

    I'm still struggling with the above problem. I tried many different things and cant get it to work. The BaseRecalibrator still gives the out of memory problem. Is there anybody having the same problem with large Bam files?

    Here is what I found out. (For everything below I used 23 gig assigned to java in a node with 24 gig ram available in total, I downsample to -dcov 40 and just used one individual)

    I tried:

    1) Running BaseRecalibrator on chr 1 only (-L 1): It ran for 102 min and finished around 12 million basepairs (5%) before it stuck and gave the error

    2) Running BaseRecalibrator on chr 1 only (-L 1) four times every time omitting one of the four covariates: Gave the same result as 1)

    3) Running BaseRecalibrator on chr 22 only (-L 22) - total of 49 million basepairs : It ran 90 min and finished 25 million basepairs (50%) before it gave out of memeory message

    4) Running BaseRecalibrator on chr MT only (-L MT) - It worked perfectly within a few seconds

    5) Running BaseRecalibrator on chr Y only (-L Y) - It worked perfectly within a minute

    6) Running BaseRecalibrator on chr 22 subsections adding 10 million basepairs each time
    10 million (-L 22 1-10000000) worked perfectly in 33 seconds
    20 million (-L 22:1-20000000) -worked perfectly in 6 minutes
    30 million (-L 22:1-30000000) again stuck at 25 million (same place as whole chr)

    7) I also first extracted only chr 22 from the bam and made a separate bam for it before reading it in to GATK BaseRecalibrator. But it behaved exactly like the -L 22 option with the whole file

    I noticed every time it is stuck for quite a while at the same place in the end before it gives the error message (see attached file of program output).

    I also attach the header of the input Bam if you can spot anything in there that could cause a problem

    Please help with some suggestions of what I can try further

    Also if this does not work in the end - I'll have to use the old CountCovariates (which did work) - is BaseRecalibrator much improved compared to CountCovariates?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    I see in the headers you posted that your bam contains several read groups. The BaseRecalibrator (and CountCovariates) operates on each read group individually; while normally it shouldn't be a problem to run the tool on the multi-RG bam file, it does add some overhead, so maybe you will get better results (i.e. not get stuck) if you split up the file into separate read groups for processing.

    I'm not sure why BaseRecalibrator would fail where CountCovariates succeeds. BaseRecalibrator is very similar to CountCovariates, except it also builds a model of indel qualities in addition to single-base qualities.

  • caschcasch Member

    Just for information. I tried to do the recal on separate read groups also (for separate samples and chromosomes) - but it still did not work.

    In the end I figured out what the problem was - It was the vcf file. I was using a merged vcf file from the KGP project. When I replaced the vcf by the dbsnp_137.b37.vcf file that is in the GATK bundle it worked perfectly. So it seems something was wrong with the KGP vcf file, and although it worked in CountCovariates it did not work in BaseRecalibrator.

Sign In or Register to comment.