Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller java.lang.NullPointerException Error

andyktandykt Member
edited June 2014 in Ask the GATK team

Hi,
I'm still new to bioinformatics so I'll apologise in advance if I've made an obvious error anywhere here... I am running HaplotypeCaller on four species of rodent (Mus musculus and three wild, non-model species), with 12 individual bam files in each.

It works fine for Mus, but for one of the non-model species it ran for 18 days, then crashed at 90%; only then did I realise I speed things up by passing it a file of only the exonic intervals I'm interested in. However, when I try running it with the interval file I get the following error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException
at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:295)
at java.util.ComparableTimSort.sort(ComparableTimSort.java:171)
at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
at java.util.Arrays.sort(Arrays.java:472)
at java.util.Collections.sort(Collections.java:155)
at org.broadinstitute.sting.utils.interval.IntervalUtils.sortAndMergeIntervals(IntervalUtils.java:254)
at org.broadinstitute.sting.utils.interval.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:805)
at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:612)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:587)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:549)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:721)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:287)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

The code I used is:

java -Xmx10g -jar ~/bin/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-nct 10 \
-R ~/REF.fasta \
--intervals INTERVALS.bed \
--interval_padding 20 \
-log LOGFILE.log \
-I 1.bam \
-I 2.bam \
-I 3.bam \
-I 4.bam \
-I 5.bam \
-I 6.bam \
-I 7.bam \
-I 8.bam \
-I 9.bam \
-I 10.bam \
-I 11.bam \
-I 12.bam \
-o OUT.vcf

It works fine without the interval file, but I don't want to risk it taking three weeks and crashing again; so I think I need to include the intervals to speed things up. The BED file I use for the interval file is simply formatted as :

scaffold2617 357126 357876
scaffold2617 357876 357936
scaffold2617 357879 357936
etc.

Finally, as mentioned the code above works for Mus musculus, so there is obviously something different happening with my non-model species, but the BED file for each is formatted in exactly the same way....

Any help would be appreciated.

Post edited by andykt on
Tagged:

Answers

  • andyktandykt Member
    edited June 2014

    Sorry about the formatting and everything being on one line for the code and BED file, it looked fine on my screen... Happy to have this deleted and I ask again if I can find out what went wrong....

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Don't worry about the formatting. To improve it you can use Markdown, but this is readable as is.

    I haven't seen this error before. I assume this occurs early into the program run? Can you tell me if this also occurs if you remove the -nct argument, and if you remove the interval_padding argument?

  • andyktandykt Member

    Hi Geraldine, thanks for your quick reply. It does occur very early in the run, yes, the last few lines of the log file are below:


    INFO 09:01:35,517 GenomeAnalysisEngine - Strictness is SILENT

    INFO 09:01:58,083 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250

    INFO 09:01:58,090 SAMDataSource$SAMReaders - Initializing SAMRecords in serial

    INFO 09:03:52,915 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 114.82

    INFO 09:03:54,637 HCMappingQualityFilter - Filtering out reads with MAPQ < 20

    INFO 09:04:02,011 GATKRunReport - Uploaded run statistics report to AWS S3


    The error still occurs when I remove the -nct argument. When I remove interval_padding the prorgam runs a little further and then I get a different error:


    ERROR MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The stop position 1340322 is less than start 1899306 in contig scaffold11

    I'm happy enough to lose the -interval padding if necessary, but I've had a look at my BED intervals file, and there is 'no stop position 1340322' for scaffold11...

    Any further guidance would be much appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Just to be clear, your non-model bams were all aligned to the same Mus reference, right?

  • andyktandykt Member

    Ah, no they weren't; they've been mapped to their own de novo draft genome sequence.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That explains it then. If the references are different, the contigs are going to be different too, so you can't use the same intervals file. Also, you won't be able to compare them to each other. If you want to do so at any point, you should map them to the same reference file.

  • andyktandykt Member

    Hi Geraldine.

    I'm not using the same intervals file for each species; the Mus interval file is specific to the Mus reference, and the others are specific to their own draft genome. In short, I have sequenced about 1000 genes in each species using sequence capture. Mus was easy - I just told Agilent which genes to capture; for the others, I provided coordinates for them to design probes from the relevant draft genome, and I am using the coordinates of these captured regions as my interval files for each of my individual species.

    Thanks for your continued help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see, thanks for clarifying what your setup is. Alright, then it sounds like you're doing the right thing, but you need to check systematically that your intervals files are all "sane" and match the corresponding sequence dictionary.

  • ProlixProlix GermanyMember

    Hi,

    I have the exact same error message with a simpler setup (one individual but call all sites)
    I use the command:
    java -Xmx299g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Btau.fasta -I indiv.bam \ --genotyping_mode DISCOVERY -stand_emit_conf 0 -stand_call_conf 0 -o indiv\_VC.vcf \ --emitRefConfidence BP_RESOLUTION --variant_index_type LINEAR \ --variant_index_parameter 128000 -nct 16

    I read in a different post that it could be a problem with an update of Java.
    So if you have Java 1.8, you could try downgrading to 1.7.
    (Note: To check what your version is, you can use the command "java -version").

    It did not help me since I already had Java7 but it seems is has solved the problem for the user from the related post. So maybe it can help you!
    Any other idea is welcome....

  • andyktandykt Member

    Thank you, I'll certainly have a go!

  • ProlixProlix GermanyMember

    I realize now that my previous post was not very clear. I was, of course, talking about the first error, the java.lang.NullPointerException error.

    Update
    In my case, removing the -nct argument removes the java.lang.NullPointerException error message. I have no message on the std error.
    But now the std output is:
    "Error occurred during initialization of VM
    Could not reserve enough space for object heap"

    • Is there something wrong with my arguments?
    • Should I "split" my bam file in separated intervals to reduce the amount of memory needed?
    • Any other suggestion?
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Prolix‌

    Hi,

    Removing -nct was right thing to do. For your new error, you can tweak your java settings by adding -Xmx4g to your command (right after Java, before -jar). This will increase the heap size.

    Everything else is fine, but please note that the stand_call_conf and stand_emit_conf arguments will be ignored when Haplotype Caller is run with -ERC.

    -Sheila

  • ProlixProlix GermanyMember

    Hi Sheila,

    Thanks for the help.

    so it actually seems that the "Error occurred during initialization of VM Could not reserve enough space for object heap" message just happens before the other error and that's why the java.lang.NullPointerException error disappears when I remove -nct.

    When I fix the memory issue by changing the queuing settings, the NullPointer.Exp... error reappears. With or without -nct.

    Any other idea?

  • pdexheimerpdexheimer Member ✭✭✭✭

    Is it the same NullPointerException? I mean, is the first line of your stack trace identical to andykt's (java.lang.NullPointerException at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:295))?

    It's entirely possible to get the same Exception in different parts of the code, which could mean that they're different underlying errors

  • ProlixProlix GermanyMember

    Thanks for the remark I would never have noticed otherwise.
    So at the beginning, I pretty sure I had the exact same error (same line number...) and now (maybe because of removing the -nct) I actually have a slightly different error:

    java.lang.ExceptionInInitializerError

    • at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:167)
    • at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:57)
    • at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:66)
    • at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:106)

    Caused by: java.lang.NullPointerException

    • at org.reflections.Reflections.scan(Reflections.java:220)
    • at org.reflections.Reflections.scan(Reflections.java:166)
    • at org.reflections.Reflections.(Reflections.java:94)
    • at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:79) ... 4 more

    Note: Right now, I cannot reproduce the other error to make sure that is was the same because I get stuck in the queue.

  • andyktandykt Member

    This new question has confused me a bit and I'm not sure where we're at... just to clarify, was the general consensus answer to my original question 'we don't know'..?

  • pdexheimerpdexheimer Member ✭✭✭✭

    @andykt‌ -

    Personally, I'm suspicious that there's something funny with your interval (and based on Geraldine's last couple of comments, I think she's looking in the same direction). Do all of your interval files crash, or only some of them? Can you edit the files to be more informative (for instance, does it still crash with only the first half of the file? What about the first hundred lines?)

    @Prolix‌ -

    I suspect you're looking at something completely different. If you're getting errors in Reflections, my first thought is that there's either something wrong with your jar file or your java

  • andyktandykt Member

    Thanks, I'll certainly have a play round with the files and see if I can figure it out.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Prolix Are you using java 1.7 or something else? Java 1.8 is not yet supported. Also, are you using the downloaded precompiled jar, or did you compile from source?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @andykt Testing your intervals in chunks like @pdexheimer suggests should be the fastest way to identify the issue. Good luck and let us know what you find.

  • ProlixProlix GermanyMember

    I am working on a university cluster so I don't really know how java and the jar file were installed.
    But a lot of people have been using the same jar file without problems...
    And on the entry node, the version of java is 1.7. So I would imagine that all nodes have the same 1.7 version of Java.

  • andyktandykt Member

    It appears to be working, I think I figured out what I did wrong and, as expected, it was something stupid...

    I think I made my intervals file from my own list of target regions, rather than those covered by the Agilent probes; as not all my regions were able to be covered by their probes, I think I was feeding the program some regions that weren't sequenced. I remade my interval files from the covered regions only (subtracting 1 from the start site as its a BED files - I think that's correct?) it seems to work. So far!

    As for the interval padding, I think that because some of my intervals start or finish at the very first or last base of a scaffold, it simply wasn't possible to include flanking sequence.

    Do these both sound logical? I hope so, as it's running now and I'm hoping it will continue!

  • ProlixProlix GermanyMember
    edited June 2014

    @andykt : Great! Both sound logical to me. Good luck!

    On my side, it seems that the computing nodes do not have the same java version as the entry node, I will try "forcing" the version 1.7 of java on the computing node, that should solve it...
    Thanks a lot.

    For people in the same situation than me, these are a few line of code you can add in a shell script to check the version of java an the computing node:
    version=$(java -version 2>&1 | awk -F '"' '/version/ {print $2}') echo version "$version"

    Post edited by Prolix on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @andykt That sounds good. Note that uncovered intervals aren't a problem as such (the program will just observe absence of coverage. However if you had this:

    some of my intervals start or finish at the very first or last base of a scaffold

    because it is a BED file, you end up trying to retrieve coordinates beyond the edge of the scaffold, which is bad.

    As for the interval padding, I think that because some of my intervals start or finish at the very first or last base of a scaffold, it simply wasn't possible to include flanking sequence

    That's what I figured. I'll look into whether we can adapt the handling of the padding so that GATK does not freak out when this happens.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Prolix Thanks for reporting back, that sounds like it might explain your issue. If you are able to verify this, you might want to drop your IT support team a line to let them know that the different java versions may be causing problems.

  • XBonXBon SwitzerlandMember

    Hi,
    I had a Java.lang.NullPointerException Error as well. I am quite new on this and I think my problem might not be the same as andykt.
    I have two human genomes from related individuals that were sequenced, pre-procesed and run through HaplotypeCaller at the same time. I did not have any problems to complete HaplotypeCaller in one of them, but for the other, I got this stack trace:

    ERROR stack trace

    java.lang.NullPointerException
    at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:51)
    at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:41)
    at java.util.TimSort.countRunAndMakeAscending(Unknown Source)
    at java.util.TimSort.sort(Unknown Source)
    at java.util.TimSort.sort(Unknown Source)
    at java.util.Arrays.sort(Unknown Source)
    at java.util.Collections.sort(Unknown Source)
    at org.broadinstitute.sting.utils.sam.ReadUtils.sortReadsByCoordinate(ReadUtils.java:320)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.finalizeActiveRegion(HaplotypeCaller.java:1111)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:949)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:825)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

    I cannot identify what the problem is. I would appreciate any clues!
    Thank you in advance!

  • XBonXBon SwitzerlandMember

    I forgot to mention that since it is whole genome I have no intervals and I use Java 1.8, but it worked completely fine for one sample and I have already launched this one twice, so this might not be the issue?

    Tnx!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @XBon‌

    Hi,

    Please try using Java version 1.7.

    You can refer here for more information: http://www.broadinstitute.org/gatk/guide/article?id=2899

    -Sheila

  • XBonXBon SwitzerlandMember

    Hi Sheila,
    Thank you for your answer. I have the same problem with java version "1.7.0_51".
    Do you have any advice on what to try next?
    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @XBon‌

    Hi,

    Are you using -nct? It looks like a concurrency issue.

    -Sheila

  • XBonXBon SwitzerlandMember

    Yes, I am. I'm running it now without it. Should this fix the problem?

    X

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited June 2014

    @XBon‌

    Hi,

    I think it sure work fine without -nct. Users have reported problems when using it.

    -Sheila

  • XBonXBon SwitzerlandMember

    Thank you very much Sheila! I'll report back when it's done running (sadly, it is a high coverage whole genome, so it will be several days)

Sign In or Register to comment.