Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

BaseRecalibrator Out of Memory problem

caschcasch Posts: 14Member

I use GATK v2.5-2-gf57256b and ran into an our of memory problem when running the BaseRecalibrator.

** ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java**

I tried assigning increasingly large memory to the program and reduced the coverage to -dcov 40. The last try was with very large memory:

Command: java -Xmx47g -jar /cc/apps/GATK/2.5.2/GenomeAnalysisTK.jar -T BaseRecalibrator -I ${line}real_calmd.bam -R ${huref} -knownSites kgp_vcf/ALL.wholeGenome_wo_wgs.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf -cov ReadGroupCovariate -cov CycleCovariate -cov ContextCovariate -cov QualityScoreCovariate -o ${line}_recal.csv -dcov 40 &>output${line}_qual_recal1

The program managed to run longer and longer as I increased the memory and decreased the coverage each time. The last run, with the 47gig and -dcov 40, ran for 90 min (with 6 days remaining) before crashing.

My BAM files are quite large (around 150 gig each). I did the recalibration previously with and older version of GATK using CountCovariates and it worked fine for these big Bam files. Is there anything I can do to make BaseRecalibrator work on these files also - since I would like to use the newer version of GATK for my whole pipeline

/Thanks, casch

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi there,

    A couple of questions -- Are you running this on multiple samples at the same time? Are the different lanes of data clearly identified as such in the read groups?

    Geraldine Van der Auwera, PhD

  • caschcasch Posts: 14Member

    I run it on one sample (individual) at a time, the bams are large because it high coverage whole human genomes all lanes are clearly identified in read groups.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hmm, there is some overhead to feeding in all lanes for a sample at once, but it shouldn't be causing that much of an issue. Since you have deep sequencing you could try splitting up recalibration by chromosome, that would help reduce the memory issue.

    Geraldine Van der Auwera, PhD

  • caschcasch Posts: 14Member

    I'm still struggling with the above problem. I tried many different things and cant get it to work. The BaseRecalibrator still gives the out of memory problem. Is there anybody having the same problem with large Bam files?

    Here is what I found out. (For everything below I used 23 gig assigned to java in a node with 24 gig ram available in total, I downsample to -dcov 40 and just used one individual)

    I tried:

    1) Running BaseRecalibrator on chr 1 only (-L 1): It ran for 102 min and finished around 12 million basepairs (5%) before it stuck and gave the error

    2) Running BaseRecalibrator on chr 1 only (-L 1) four times every time omitting one of the four covariates: Gave the same result as 1)

    3) Running BaseRecalibrator on chr 22 only (-L 22) - total of 49 million basepairs : It ran 90 min and finished 25 million basepairs (50%) before it gave out of memeory message

    4) Running BaseRecalibrator on chr MT only (-L MT) - It worked perfectly within a few seconds

    5) Running BaseRecalibrator on chr Y only (-L Y) - It worked perfectly within a minute

    6) Running BaseRecalibrator on chr 22 subsections adding 10 million basepairs each time 10 million (-L 22 1-10000000) worked perfectly in 33 seconds 20 million (-L 22:1-20000000) -worked perfectly in 6 minutes 30 million (-L 22:1-30000000) again stuck at 25 million (same place as whole chr)

    7) I also first extracted only chr 22 from the bam and made a separate bam for it before reading it in to GATK BaseRecalibrator. But it behaved exactly like the -L 22 option with the whole file

    I noticed every time it is stuck for quite a while at the same place in the end before it gives the error message (see attached file of program output).

    I also attach the header of the input Bam if you can spot anything in there that could cause a problem

    Please help with some suggestions of what I can try further

    Also if this does not work in the end - I'll have to use the old CountCovariates (which did work) - is BaseRecalibrator much improved compared to CountCovariates?

    Thanks!

    txt
    txt
    headerbam.txt
    3K
    txt
    txt
    output_qual_recal1t18_30.txt
    18K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi there,

    I see in the headers you posted that your bam contains several read groups. The BaseRecalibrator (and CountCovariates) operates on each read group individually; while normally it shouldn't be a problem to run the tool on the multi-RG bam file, it does add some overhead, so maybe you will get better results (i.e. not get stuck) if you split up the file into separate read groups for processing.

    I'm not sure why BaseRecalibrator would fail where CountCovariates succeeds. BaseRecalibrator is very similar to CountCovariates, except it also builds a model of indel qualities in addition to single-base qualities.

    Geraldine Van der Auwera, PhD

  • caschcasch Posts: 14Member

    Just for information. I tried to do the recal on separate read groups also (for separate samples and chromosomes) - but it still did not work.

    In the end I figured out what the problem was - It was the vcf file. I was using a merged vcf file from the KGP project. When I replaced the vcf by the dbsnp_137.b37.vcf file that is in the GATK bundle it worked perfectly. So it seems something was wrong with the KGP vcf file, and although it worked in CountCovariates it did not work in BaseRecalibrator.

Sign In or Register to comment.