Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Exception java.lang.NullPointerException when running multithreaded VariantFlitration

Running the multithreaded command below causes an error, although single threaded works fine.

$ java -jar $GATK -T VariantFiltration -R human_g1k_v37.fasta -o chrom01_subset_biallelic_filtered.vcf --variant chrom02_subset_biallelic.vcf.gz --filterExpression "AF > 0.02" --filterName "MAFfilter" --num_threads 2

Is this a bug, or am I doing something wrong?

If you want to see the VCF data that causes the problem, I can supply it, but it's not exactly open source data so it requires some discretion.

The reference FASTA can be downloaded here, and the index and dictionary files are generated via:

$ samtools faidx human_g1k_v37.fasta
$ java -jar $PICARD CreateSequenceDictionary R=human_g1k_v37.fasta O=human_g1k_v37.dict

The error is pasted in here:

INFO 14:06:35,630 HelpFormatter - ----------------------------------------------------------------------------------
INFO 14:06:35,632 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 14:06:35,632 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 14:06:35,633 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 14:06:35,633 HelpFormatter - [Mon May 01 14:06:35 CEST 2017] Executing on Linux 2.6.32-642.1.1.el6.x86_64 amd64
INFO 14:06:35,633 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26
INFO 14:06:35,636 HelpFormatter - Program Args: -T VariantFiltration -R human_g1k_v37.fasta -o chrom01_subset_biallelic_filtered.vcf --variant chrom02_subset_biallelic.vcf.gz --filterExpression AF > 0.02 --filterName MAFfilter --num_threads 2
INFO 14:06:35,639 HelpFormatter - Executing as [email protected] on Linux 2.6.32-642.1.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26.
INFO 14:06:35,640 HelpFormatter - Date/Time: 2017/05/01 14:06:35
INFO 14:06:35,640 HelpFormatter - ----------------------------------------------------------------------------------
INFO 14:06:35,640 HelpFormatter - ----------------------------------------------------------------------------------
INFO 14:06:35,691 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:06:35,781 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
WARN 14:06:35,839 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 14:06:35,847 MicroScheduler - Running the GATK in parallel mode with 2 total threads, 1 CPU thread(s) for each of 2 data thread(s), of 16 processors available on this machine
INFO 14:06:35,946 GenomeAnalysisEngine - Preparing for traversal
INFO 14:06:35,951 GenomeAnalysisEngine - Done preparing for traversal
INFO 14:06:35,952 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 14:06:35,952 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 14:06:35,952 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

ERROR --
ERROR stack trace

java.lang.NullPointerException
at java.util.LinkedList.node(LinkedList.java:577)
at java.util.LinkedList.get(LinkedList.java:477)
at org.broadinstitute.gatk.tools.walkers.filters.FiltrationContextWindow.getContext(FiltrationContextWindow.java:66)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.filter(VariantFiltration.java:367)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:318)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:99)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Issue · Github
by shlee

Issue Number
2027
State
closed
Last Updated
Closed By
vdauwera

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited May 2017

    Hi @olavur,

    VariantFiltration is supposed to be able to parallelize using the -nt option. Given that you can run the identical command that excludes multithreading but not the threaded commend, the error you see is likely a bug.

    The reference FASTA you provide the link to is actually from our Resource Bundle. There is no need for you to generate the index and dictionary files as these are also provided in the Resource Bundle. Let's rule out that something went funny with your independent index and dictionary generation. Can you download the index and dictionary we provide and try your threaded command again?

    If this errors, please can you fill out a bug report. The link to instructions is on the left menu and it has you generate a small data snippet to recapitulate the error. If this data is human subject data with restricting sharing, then DO NOT UPLOAD THE DATA. If the data is on GDC, you point me to the file as I have access to certain restricted data. Thanks.

  • olavurolavur Member

    @shlee My problem seems to be fully dependent on the data. I've tried running the command on a different dataset and it works, and I've tried to run the command on a subset of the variants in the problem causing data which works as well. This also means that it is difficult to make a small dataset that is less sensitive information and also causes the problem. Does it make sense to submit a bug report on this problem with no data?

  • olavurolavur Member

    @shlee Another thing. As described above, the command works when running the command on a subset of the SNPs (the first 4 in the file). So is there a way for me to find out which SNP(s) is causing the error (if that is the case)?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    This is a thread safety bug that has been fixed in development. You can use the nightly build to get past it.
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks Geraldine. @olavur, you can find the nightly builds here.

  • olavurolavur Member
    edited May 2017

    @shlee @Geraldine_VdAuwera I tried the nightly build (2017-05-03-g5c85575), and got the same error.

    Post edited by olavur on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @olavur. Geraldine is away on vacation so let's try to suss this out together. First, I'm not familiar with the bug fix (I've been away for most of April) so I'll have to ask someone about this particular aspect of our conversation. In the meanwhile, if you want to try to narrow down the offending records, there are multiple ways you could approach this, e.g.:

    1. Dilute the file by halves and continue to do so for the error-causing portion.
    2. Use the -L parameter to check each contig's worth of variants.
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    After talking to @Sheila, it occurs to me that more recent VCF features, e.g. representations of spanning deletions or ambiguous SVs, may not be amenable to current multithreading approaches. If this or some similar new feature of variant records that depend on other variant records applies to your variant data, then our advice to you is to just not multithread. Alternatively, you could simplify your VCF to remove these features. If you have the time, it would be really great to narrow down what is actually causing the problem.

  • olavurolavur Member

    I tried using -L to find what variants are causing problems, but it now seems more like it is the number of variants analyzed that is causing problems.

    Running the command with -L 2:1-n and just increasing n, the command eventually crashes (specifically, when n goes from 1e6 to 2e6). However, when trying to narrow it down, like -L 2:n1-n2 and finding the range n1-n2, the problem disappears when the region gets too small; for example, -L 2:1000000-10000000 causes an error, but chopping that region up into intervals of size 1e6 does not give an error.

    Long story short, the error seems to occur when too many variants are analyzed. How can this be?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @olavur. So here is what we know so far:

    1. VariantFiltration without -nt threading on your VCF works fine.
    2. VariantFiltration with -nt threading gives a java.lang.NullPointerException.
    3. VariantFiltration with -nt threading on the first four records of the VCF works fine.
    4. VariantFiltration with -nt threading and -L on small genomic intervals works fine.
    5. VariantFiltration with -nt threading and -L on larger genomic intervals errors (presumably the same error as (2)).

    Just out of curiosity, how many samples does your VCF contain and what is chrom02_subset_biallelic.vcf.gz's file size?

  • olavurolavur Member

    @shlee Yes, this is all correct.

    The size of chrom02_subset_biallelic.vcf.gz is 43MB, it contains 2594192 variants and 9 individuals (it is sort of a debugging test set).

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hmm. Given the small file size and number of samples, memory should not be an issue. I'm going to have to consult one of our developers. In the meanwhile, if you don't mind, perhaps we can rule out individual samples from causing the error. For example, select out one sample (or other number of samples) using SelectVariants -sn and then see if the VariantFiltration -nt command still fails.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @olavur,

    Can you again check if your VCF file contains spanning deletions? Let's definitively rule this out.

  • olavurolavur Member

    @shlee Didn't I rule out the possibility that individual samples were causing the error by using the -L flag?

    I didn't find the * overlapping deletion marker in my VCF file. But this doesn't rule out that there is a spanning deletion in the "alternate" representation (as shown in the table in the article you linked). Does this answer your question?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @olavur. The -L flag targets specified genomic intervals. The alternate representation that avoids using the placeholder asterisk * should not cause the problems that the overlapping deletion marker would. We can then rule out spanning deletions as a source of error.

  • olavurolavur Member

    I used SelectVariants to run the command on individuals (samples) independently and got the same error for each individual.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @olavur. We are really curious what is going on with your error. We should try to find a way for those on our team to view your data. Are you familiar with FireCloud? Or by any chance is your data available within the GDC?

  • olavurolavur Member

    @shlee The data is not available on GDC.

    I'm not familiar with FireCloud. Can I use it to grant secure access to you (or other GATK developers)? If so, I will have to check with some of my superiors that it is ok.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Yes, @olavur, FireCloud is set up for secure data access.

  • olavurolavur Member

    @shlee

    According to the documentation:

    For a gentle introduction on how to use FireCloud, see the FireCloud Quickstart Guide (coming soon).

    I find it difficult to get started without such a guide. Any tips?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited May 2017

    @olavur,

    I'm told the timeline for the Quickstart Guide is in the order of some weeks. Are you possibly local to Boston? We have a dedicated FireCloud office hour where someone can sit down with you and walk you through what you need to know. If you aren't local, it is also possible to call in to the office hour via Google Hangouts. It is every Wednesday 2–3PM, Eastern Standard Time (Boston, MA). I can help set this up for you if you'd like.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hey @olavur, I've talked to @Geraldine_VdAuwera about our conversations and we'd like to point out that in GATK4 there will no longer be multithreading. GATK4 is faster all around and so your best bet for time savings is to use GATK4 when it is released. Please stay tuned for when this will be.

Sign In or Register to comment.