We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
java.lang.NullPointerException with haplotypecaller gvcf mode

The following error i got when running on one of the cluster nodes on a single sample out of multiple. On the headnode the error resolved, but why I don't know.
Does this error sound familiar? If so, link me to the fix / discussion. If not, please do not spend too much time on it.
This is the error log:
INFO 21:49:26,894 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:49:26,898 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 21:49:26,898 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:49:26,903 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 21:49:26,909 HelpFormatter - Program Args: -T HaplotypeCaller -R human_g1k_v37.fasta --dbsnp dbsnp_138.b37.vcf -I RCC-ER.bam -stand_call_conf 10.0 -stand_emit_conf 30.0 -o RCC-ER.g.vcf -nct 8 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 INFO 21:49:26,915 HelpFormatter - Executing as [email protected] on Linux 3.0.101-0.7.17-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15. INFO 21:49:26,915 HelpFormatter - Date/Time: 2015/03/03 21:49:26 INFO 21:49:26,915 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:49:26,916 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:49:27,200 GenomeAnalysisEngine - Strictness is SILENT INFO 21:49:27,354 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 21:49:27,366 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 21:49:27,436 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 INFO 21:49:27,477 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 21:49:27,810 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 48 processors available on this machine INFO 21:49:27,901 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 21:49:28,234 GenomeAnalysisEngine - Done preparing for traversal INFO 21:49:28,235 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 21:49:28,235 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 21:49:28,236 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 21:49:28,237 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 21:49:28,237 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output INFO 21:49:28,445 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 21:49:28,447 PairHMM - Performance profiling for PairHMM is disabled because HaplotypeCaller is being run with multiple threads (-nct>1) option Profiling is enabled only when running in single thread mode INFO 21:49:51,883 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 21:49:51,884 VectorLoglessPairHMM - Using vectorized implementation of PairHMM INFO 21:49:55,536 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at java.lang.String.checkBounds(String.java:374) at java.lang.String.<init>(String.java:314) at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301) at htsjdk.samtools.BAMRecord.decodeReadName(BAMRecord.java:331) at htsjdk.samtools.BAMRecord.getReadName(BAMRecord.java:220) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingGraph.addRead(ReadThreadingGraph.java:585) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.createGraph(ReadThreadingAssembler.java:178) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.assemble(ReadThreadingAssembler.java:117) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.LocalAssemblyEngine.runLocalAssembly(LocalAssemblyEngine.java:169) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:1163) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1000) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------
Best Answers
-
Kurt ✭✭✭
I think the ThreadPoolExecutor is something that happens when you add -nct to the HaplotypeCaller. It usually is a transient, unreproducible error. I just take -nct out when running HaplotypeCaller.
-
Geraldine_VdAuwera Cambridge, MA admin
The transient nature of these bugs does indicate a link to the multithreading; the bug may be caused by the multithreading itself (could be we're losing track of an object somewhere) or the bug is caused by a bad read that is not always used since downsampling is non-deterministic when multithreading is used.
We have also moved away from using nct with HaplotypeCaller; we find that parallelizing by scatter-gathering jobs is more effective. These can be done in combination of course.
Btw when you run HC in GVCF mode, the confidence thresholds (-stand_call_conf 10.0 -stand_emit_conf 30.0) are ignored and set to 0 internally, as noted in the line:
INFO 21:49:28,237 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
The thresholds are then applied in the GenotypeGVCFs step.
Answers
Hmm, this looks like the program choked on one particular read, but no idea why. I don't remember seeing this before, sorry.
Thanks for answering,
@Geraldine_VdAuwera:
After closer inspection I see it with multiple samples... Can you suggest a way of debugging?
# If good answer then i will flag the upper answer as sufficient :P
more details:
i'm using bwa version 0.7.10 picard version 1.102 btw
in short this the workflow:
bwamem>addorreplacegroups>mergebamfiles>markduplicates>indelrealignment>haplotypecallergvcf
This seems to work:
bwamem>addorreplacegroups>mergebamfiles>markduplicates>indelrealignment>BSQR>haplotypecaller
The regular haplotypecaller on all samples after BSQR printreads runs fine (at the moment...)
I think the ThreadPoolExecutor is something that happens when you add -nct to the HaplotypeCaller. It usually is a transient, unreproducible error. I just take -nct out when running HaplotypeCaller.
I second what @Kurt said. We run it with "-nct 1"; we see relatively frequent crashes otherwise. The process sometimes makes it to completion, so as an alternative you could keep submitting the parallelized job until it succeeds.
The transient nature of these bugs does indicate a link to the multithreading; the bug may be caused by the multithreading itself (could be we're losing track of an object somewhere) or the bug is caused by a bad read that is not always used since downsampling is non-deterministic when multithreading is used.
We have also moved away from using nct with HaplotypeCaller; we find that parallelizing by scatter-gathering jobs is more effective. These can be done in combination of course.
Btw when you run HC in GVCF mode, the confidence thresholds (-stand_call_conf 10.0 -stand_emit_conf 30.0) are ignored and set to 0 internally, as noted in the line:
The thresholds are then applied in the GenotypeGVCFs step.
Mailed the internal resources: Could be because of missing filesystem mounts because of the out of memory killer killing the GPFS daemon.
Cluster complexity at its best ;_;
Sorry for the complaint => maybe check for this as an error
Oh interesting -- thanks for letting us know!