The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Memory error, and ProgressMeter strangeness during HaplotypeCaller

Posts: 28Member
edited April 2014

Log attached.
It's become hard to tell what the columns mean in the ProgressMeter output, but I know this is strange. I see six columns in the output, but the header line has apparently nine columns space-delimited. The second column is all zeros, and fourth approaches infinity before my memory error. "because you did not provide enough memory"

The crash is probably more pertinent. This morning when I found it had crashed, I saw that I had not specified any -Xmx argument to java, so I changed it to the amount of RAM (32G) divided by the -nct parameter 8, for about 3700MB each. That's a worst case scenario of duplicating ram usage per thread, so now I'm stumped why and how is 3.7GB not enough for a read-walker ??

For now I'm going to try -nct 4 and -Xmx7G

edit:
Re-trying with -nct 4 and -Xmx7G seems to show the same behavior; I have zero activeregions, and a hundreds of weeks in column four. My estimated runtime is fluctuating between 70 hours and 6 days; so I expect the same memory crashout in a few hours. Will update then.

Post edited by KStamm on
Tagged:

Ah, the ProgressMeter strangeness is due to the unwieldy column headers, and (oops) a missing field. Decomposed, you read it as:

Location
1:825393

processed.active regions
0.00e+00

[elapsed] -> missing in the header, we'll fix this
30.0 s

runtime per.1M.active regions
49.6 w

completed
0.0%

total.runtime
31.3 h

remaining
31.3 h


The runtime per unit field gives especially useless results for active regions, as it's originally meant for individual sites. I'll see if we can get this made more sane as well.

The memory error is a little fuzzier to address. Keep in mind that HC is no mere read-walker, it's an ActiveRegion walker that performs all sorts of fancy calculations including De Bruijn graph assembly (and is currently not smart about downsampling). I can only recommend increasing memory or decreasing thread count...

Geraldine Van der Auwera, PhD

• Posts: 28Member

Thanks, that helps a lot. Elapsed is a good reason for that column to have been growing.

About the memory error, I found a few other threads on the support forums, but they refer to version 2.2 or 2.4, and refer to downsampling or minPruning. I'll check what the read-depth looks like at the site of the crash.

The recommendations about minPruning are certainly still applicable for 3.x. Downsampling also in principle, but in practice it does not currently behave very well, unfortunately.

Geraldine Van der Auwera, PhD

• Posts: 28Member

I went to check the coverage depth at the locus reported just before the crash. samtools depth said nothing!. Of course the crash was probably a few hundred bases downstream of the last successfully reported locus. Notably, this is a centromere, and I think it makes sense for a haplotype realigner to choke up when the reference is all NNNN; although there should have been no reads present. Maybe it had some unresolved haplotypes upstream of the centromere and was searching an exponential space? It's probably pertinent that this crash site, 128megabases, has more than 20 million N's in both directions.

Maybe there's a region mask function I need to apply to the caller's argument list ? If I had specified regions of interest maybe this problem would not have occurred ?

• Posts: 28Member

This is WGS, about 40x coverage. I'm trying to follow the new best-practices and do eleven individual GVCF before rejoining them.

So the re-run with -nct4 and -Xmx7G also failed, after almost two hours, even though progress was going fine, you see we were at 4% completion with 40 hours remaining. This stop point is a little later than the previous, but still within the centromere. Should have zero reads there too.

INFO  04:56:30,968 ProgressMeter -     1:128886586        0.00e+00  100.8 m     9998.8 w      4.2%        40.4 h    38.7 h
INFO  04:57:36,674 ProgressMeter -     1:136829703        0.00e+00  101.9 m     10107.5 w      4.4%        38.5 h    36.8 h
INFO  05:01:29,849 ProgressMeter -     1:136876486        0.00e+00  103.3 m     10248.0 w      4.4%        39.0 h    37.3 h

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "ProgressMeterDaemon"
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version nightly-2014-03-18-g8d1a043):
##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program.  See the -Xmx JVM argument to adjust the maximum heap size provided to Java
##### ERROR ------------------------------------------------------------------------------------------


The partial output GVCF file looks fine, (although being my first GVCF, I wouldn't know what a bad one looks like!)

1       121398195       .       AC      A,<NON_REF>     105.73  .       BaseQRankSum=0.509;ClippingRankSum=1.054;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=21.00;MQ0=0;MQRankSum=-0.936;ReadPosRankSum=1.552   GT:AD:DP:GQ:PL:SB       0/1:44,9,0:53:99:143,0,1948,284,1976,2260:12,32,0,9
1       121398197       .       C       <NON_REF>       .       .       END=121398197   GT:DP:GQ:MIN_DP:PL      0/0:54:0:54:0,0,1147


What we can see is this final variant position is some 25 megabases from the final ProgressMeter output. This bolsters my centromere memory leak hypothesis.

Next I'm going to attempt upgrading to a stable version rather than this March18th nightly; then perhaps start putting together a BED file of non-junk DNA regions. I don't see any BED files as part of the GATK bundle, so this will be a manual construction.

Yep, your GVCF looks fine, indeed. We mostly process exomes so if it's centromere related we would probably not naturally come across this. So please do keep us posted on what results you get. If this is a bug we want to know so we can fix it. But hopefully the stable version will perform better. Good luck!

Geraldine Van der Auwera, PhD

• Posts: 28Member
edited April 2014

Well updating to 3.1-1 didn't change anything. The Exception dump is a few lines shorter and cleaner, now this might be helpful:

INFO  17:56:13,120 ProgressMeter -     1:121480508        0.00e+00   97.5 m     9672.9 w      3.9%        41.5 h    39.9 h
INFO  17:57:13,122 ProgressMeter -     1:121483974        0.00e+00   98.5 m     9772.1 w      3.9%        41.9 h    40.3 h
INFO  17:58:14,095 ProgressMeter -     1:129193186        0.00e+00   99.5 m     9872.9 w      4.2%        39.8 h    38.2 h
INFO  17:59:17,333 ProgressMeter -     1:136646788        0.00e+00  100.6 m     9977.4 w      4.4%        38.0 h    36.4 h
INFO  18:00:25,092 ProgressMeter -     1:136827585        0.00e+00  101.7 m     10089.5 w      4.4%        38.4 h    36.7 h
INFO  18:01:28,012 ProgressMeter -     1:136827886        0.00e+00  102.7 m     10189.3 w      4.4%        38.8 h    37.1 h
at java.util.Formatter.parse(Formatter.java:2525)
at java.util.Formatter.format(Formatter.java:2469)
at java.util.Formatter.format(Formatter.java:2423)
at java.lang.String.format(String.java:2797)


GC overhead limit doesn't sound like the kind of thing I can fix by giving more than 7GB. It really feels like a systemic bug to have the crash and memory error come out of ProgressMeter.printProgress 's String.format command. This isn't even HaplotypeCaller's crash. And youll notice that while the time elapsed progressed smoothly, the genomic coordinates have quite a jump from 121MB to 129 then 136MB. It does appear to be speeding along the centromere efficiently, but I would guess the ProgressMeterDaemon is queueing up the whole 25 million NNNN sequence into some kind of internal error message. Really telling that the system runs out of RAM during a String.format within a printProgress().

Since it's friday night; I'll give this one more try at -nct1 and -Xmx30G. Note this first sample is a 164GB bam with 1170M reads with 99% mapped.

Post edited by KStamm on

Well darn. If you can produce a data snippet that reproduces this we'll take a look in the debugger...

Geraldine Van der Auwera, PhD

• Posts: 28Member

I cut out region 121.39 megabase through 140 megabase and it's shockingly ~5 million reads. The test case is 500 megabytes, so I wanted to run it, see the crash, and snip in closer to the error site before uploading. But, it passed the whole chromosome successfully. It was using 12GB of ram. So -Xmx7G was really the problem, but I still think something is funny to need that kind of resources for a 500 megabyte BAM. Might have to do with GVCF/reference-model. Maybe it will double-down and choke on chromosome 2, but it will take the -nct1 run some hours before it reaches chr2.

Of course it passed the test at -nct1 Xmx30G, so it would be a good idea to look at -nct4 -Xmx8G.

INFO  19:00:30,051 ProgressMeter -     1:135577798        0.00e+00    5.9 m      590.1 w      4.4%         2.3 h     2.2 h
INFO  19:01:02,995 ProgressMeter -     1:137296786        0.00e+00    6.5 m      644.6 w      4.4%         2.4 h     2.3 h
INFO  19:01:35,461 ProgressMeter -     1:137385235        0.00e+00    7.0 m      698.3 w      4.4%         2.6 h     2.5 h


The error message is a little different. Curiously when I monitor memory usage with 'top' it shows this process using 8.4GB and thirteen hundred percent CPU. As if it was -nct 13. That's definitely not supposed to happen, I would expect -nct4 to not use more than 500% CPU. Perhaps there is some kind of internal forking. Also that error printed incredibly slowly; I've never seen a java stacktrace take seconds between each line. That might be normal behavior for a java.lang.OutOfMemoryError. While the error is still within ProgressMeter, this time it came from takeProgressSnapshot; and might be mad about the 700 week estimate.

I won't be able to upload an example until I've made sure it is clean of patient-specific information, so that too will take a while.

• Posts: 28Member

After three days, the single-threaded run is reporting about six days left for the first sample. I can't wait a week per sample, so I'm going to try -nct4 again, this time with a contig gaps mask to omit the centromere regions. Using an annotation of unmappable regions from the UCSC TableBrowser Gaps track, I have a BED of regions not worth trying HaplotypeCaller over. I'll try running "-nct4 -Xmx12G" now, and expect it to run smoothly over these non-pathogenic genomic ranges.
As there is less than 4x12 GB physical ram on this server, we'll soon learn if HC requires the full amount, and what happens when it can't get it!

OK, please let us know how it goes.

Geraldine Van der Auwera, PhD

• Posts: 28Member

Things went much further, and the error message is no longer memory related, so I think we did avoid the error that started this forum-thread. The new error:

INFO  12:14:14,879 ProgressMeter -     8:134757330        1.04e+10   22.2 h        7.0 s     51.6%        43.0 h    20.8 h
INFO  12:15:14,880 ProgressMeter -     8:136728598        1.04e+10   22.2 h        7.0 s     51.7%        43.0 h    20.8 h
INFO  12:16:14,128 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NullPointerException
oodCalculationEngine.java:443)
oodCalculationEngine.java:417)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------  Might be time to start a new thread, since this is no longer a memory and progressmeter error. I guess we solved that one by using a regions BED to avoid the centromeres. This new error has stopped at about 20-24 hours in each of four samples; at varying genomic locations. Does this mean HaplotypeCaller is incompatible with WGS, and I should fall back to UG? • Posts: 7,369Administrator, GATK Developer admin No, HaplotypeCaller is fine with WGS -- as far as HC is concerned, there is no difference. This looks more and more like a multithreading issue to me. We've seen some potential thread safety issues recently with HC; it's hard to pin down because in our hands it doesn't happen, so it may be platform-related. We're going to look into this but for now I would recommend that you run single-threaded, but parallelize by chromosome. You can run HC per chromosome then concatenate the resulting VCFs with the CatVariants utility. Geraldine Van der Auwera, PhD • Posts: 28Member edited April 2014 Thanks for your support, I'll work on chopping this by chromosome. Platform details Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (2 CPU times 4 cores HT ~= 16 threads) java version "1.7.0_45" OpenJDK Runtime Environment (rhel-2.4.3.3.el6-x86_64 u45-b15) OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode) GATK (version 3.1-1-g07a4bf8): I have a NullPointer at (PairHMMLikelihoodCalculationEngine.java:443), and that region, according to the github for gatk-protected  for( final Map.Entry<GATKSAMRecord, Map<Allele,Double>> entry : stratifiedReadMap.get(sample).getLikelihoodReadMap().entrySet() ) { // Compute log10(10^x1/2 + 10^x2/2) = log10(10^x1+10^x2)-log10(2) // First term is approximated by Jacobian log with table lookup. haplotypeLikelihood += ( MathUtils.approximateLog10SumLog10(entry.getValue().get(iii_allele), entry.getValue().get(jjj_allele)) + MathUtils.LOG_ONE_HALF ); }  Seems like the error comes from entry.getValue().get(allele), indicating either entry is null or the getValue returned null. Post edited by KStamm on • Posts: 28Member Indeed it is related to multithreading; my eleventh sample just finished with this message: INFO 15:16:46,059 ProgressMeter - 1:121134037 4.34e+08 67.2 m 9.0 s 4.2% 26.7 h 25.5 h INFO 15:17:46,060 ProgressMeter - 1:121355153 4.34e+08 68.2 m 9.0 s 4.2% 27.0 h 25.9 h INFO 15:17:51,087 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:394)
at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:413) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:412)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------  Line 257 of GenotypingEngine.java  for (Map.Entry<GATKSAMRecord, Map<Allele, Double>> perRead : perSample.getValue().getLikelihoodReadMap().entrySet()) {  Looks like threads are sharing this Map object. • Posts: 28Member edited April 2014 @Geraldine_VdAuwera said: We're going to look into this but for now I would recommend that you run single-threaded, but parallelize by chromosome. You can run HC per chromosome then concatenate the resulting VCFs with the CatVariants utility. So running HC region by region works great. I used GNU Parallel to cut lines out of my non-centromere/non-gaps BED file and run them as separate java processes. The result is nearly three hundred GVCF files per sample. You recommended CatVariants to recombine these files.  java -cp GenomeAnalysisTK.jar org.broadinstitute.sting.tools.CatVariants \ -R ref.fasta \ -V input1.vcf \ -V input2.vcf \ -out output.vcf \ -assumeSorted  Having three hundred -V lines is unweildy but not impossible with some BASH script. A real problem arises that CatVariants refuses to work on GVCF, and reports an error that the files must be BCF or VCF. Maybe It's just the filenames, and I could change them all; but since GVCF is quite different from regular VCF; I need to ask if CatVariants is the right tool for this job? Post edited by KStamm on • Posts: 22GATK Developer mod It's just a filename issue. Like other GATK tools, the extension .vcf is necessary for GVCFs. Some users prefer to use file names ending in .g.vcf to make this more explicit. Processing .list files is on the TODO list for CatVariants. Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute • Posts: 2Member Has there been any progress on this thread? I've just been hitting the same type of "out of memory error" running SplitNCigarReads on some RNAseq data. I've been following the recipe for variant calling from RNAseq, all single threaded and pretty vanilla command line: java -Xmx32G -Djava.io.tmpdir=./tmp -jar ~/bin/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T SplitNCigarReads -R hg19_karyoSorted.fa -I dedupped_chr12_rg_added.bam -o split_chr12.bam -rf ReassignMappingQuality -DMQ 60 -U ALLOW_N_CIGAR_READS I tried upping the Java heap space to 32G, but that didn't help. I have noticed that I'm running into this problem somewhere on chr12 in several different independent samples. When I extracted just chr12 to a separate bam file, I noticed the process thrashing around the same spot - chr12:66,451,461... INFO 15:37:16,793 ProgressMeter - chr12:66451461 1.25e+07 14.8 m 71.0 s 65.2% 22.7 m 7.9 m I'm running Java build 1.7.0-b147, GATK 3.1-1 on CentOS 6.5. Here is the original error I hit: INFO 04:45:06,582 ProgressMeter - chr12:66451461 1.58e+08 3.7 h 84.0 s 65.2% 5.7 h 119.6 m INFO 04:57:28,575 ProgressMeter - chr12:66451461 1.58e+08 3.8 h 86.0 s 65.2% 5.8 h 2.0 h Exception in thread "ProgressMeterDaemon" ##### ERROR ------------------------------------------------------------------------------------------ java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Unknown Source) at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source) at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source) at java.lang.AbstractStringBuilder.append(Unknown Source) at java.lang.StringBuilder.append(Unknown Source) at java.lang.StackTraceElement.toString(Unknown Source) at java.lang.String.valueOf(Unknown Source) at java.lang.StringBuilder.append(Unknown Source) at java.lang.Throwable.printStackTrace(Throwable.java:657) at java.lang.Throwable.printStackTrace(Throwable.java:720) at org.apache.log4j.spi.LocationInfo.(LocationInfo.java:113) at org.apache.log4j.spi.LoggingEvent.getLocationInformation(LoggingEvent.java:247) at org.apache.log4j.helpers.PatternParser$ClassNamePatternConverter.getFullyQualifiedName(PatternParser.java:538)
at org.apache.log4j.helpers.PatternParser\$NamedPatternConverter.convert(PatternParser.java:511)
at org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
at org.apache.log4j.PatternLayout.format(PatternLayout.java:502)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.info(Category.java:666)

##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):

@kbrown We haven't had a test case to debug on. If you can generate a test snippet that reproduces the error we're happy to take a look at it. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

Geraldine Van der Auwera, PhD

• Posts: 28Member

My original error was not from "-T SplitNCigarReads" but actually "-T HaplotypeCaller", but running out of memory during a ProgressMeter update could be related. I solved the problem with a workaround suggested a few times on these forums, to not use the tool's multithreading, but parallelize over regions instead, while avoiding pathological regions. My problem was processing WGS without specifying genomic coordinates, and the tool choking at a region of difficult sequence (near a centromere the # of possible haplotypes supporting indels is exponential).

So I got a BED file of centromeres and gaps from UCSC TableBrowser, took the inverse to get a BED of 'good regions' with some confidence I'm not missing any relevant genes, then used GNU-Parallel to run an instance of HaplotypeCaller over each line of the good regions Bed. This resulted in 275 vcf files per sample, that are sorted and merged into a master before pooling samples with GVCF joint calling. While it's possible for an individual segment to crash, none of the log files grep for Error.

I see that chr12 is broken into seven pieces by this method, and your problem area is included in the 12:37-109 (MB) 'good' region. When I look at some Tophat generated bam, there's one read at your bad site, but when I look at a BWA (aln sampe) bam, there's a pile of 40 reads with mapq zero, and holy cow, the reference genome is almost one hundred consecutive Thymines. That would break a lot of algorithms. Since there's no gene between 66.38 and 66.50 MB, maybe you can exclude the region without impacting your RNA-Seq.

• Posts: 2Member

Thanks @KStamm. I had read your posts (and impressive debugging!) thoroughly, and was trying to apply it to my issue. I realize you were finding the error in the HaplotypeCaller, but one of your posts had mentioned the ProgressMeter throwing the error, which is why I figured our problems could be related.

After some careful inspection of my BAM files, I found the problem exactly where you mentioned...the stretch of ~100 T's. I think the problem is one of a bad library prep (VERY high 3' bias), and literally MILLIONS of reads aligning precisely on that stretch of T's. I suspect this is an unlikely and atypical problem, so it may not be worth the effort to try and trap it. However, I'd be happy to submit the BAM file for that 3MB region if there is interest in dealing with the situation in GATK.

• Kevin...