Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
"Could not open array genomicsdb_array at workspace" from GenotypeGVCFs in GATK 4.0.0.0

I experience Issues with GenotypeGVCFs and GenomicsDB input in the final GATK4 release using the official Docker image. This does not occur using the 4.beta.6 release. It looks like there has been a bug in an earlier beta release with the same error message which got fixed. Is my issue related to that old bug or just results in the same error message? What can I do to debug the issue?
2018-01-10T12:15:04.154516155Z terminate called after throwing an instance of 'VariantQueryProcessorException' 2018-01-10T12:15:04.154547266Z what(): VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db 2018-01-10T12:15:04.154561314Z 2018-01-10T12:15:04.620517615Z Using GATK wrapper script /gatk/build/install/gatk/bin/gatk 2018-01-10T12:15:04.620517615Z Running: 2018-01-10T12:15:04.620517615Z /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -V gendb:///keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db --output chr22_db.vcf --reference /keep/db91e5f04cbd9018e42708316c28e82d+2160/hg19.fa
Post edited by moritzgilsdorf on
Answers
Can you try with just gendb:// e.g. gendb://keep ?
I can use GenotypeGVCFs with this way without any issues.
This doesn't work either.
@moritzgilsdorf
It looks like you have three slashes in your path instead of two:
should be
@Geraldine_VdAuwera
That is not the problem. The absolute path for the genomicsdb is
If I remove one slash the workflow tries to access it relative to the working directory.
Other arguments that the call itself is not wrong:
1. The same execution works when using the 4.beta.6 docker
2. If I point the gendb to a existing but invalid directory (not containing a genomicsdb) I see something like this:
This looks like the error in my initial posting indicates that the genomics db could be opened partially (found
vidmap.json,callset.json,vcfheader.vcf
) but has some issues with thegenomicsdb_array
subdirectoryI did some additional tests and believe this issue is related to the fact that in my setup the genomicsdb is read only. It works if the whole directory and all contained files are writable. If they are not writable, it crashes with the mentioned error message. I will try to work around this somehow which will be a hack as I'm using Arvados with CWL and the input oft this Workflow step is read_only by default.
I'm wondering why the genomics_db is required to be writable. I checked the timestamps for the files in the genomics_db directory structure after a successful execution directly using Docker and it looks like there is no modification anywhere. Therefore reading only should not cause an error, in my opinion.
Oh that’s interesting — we’ll check with the dev team.
Hi Geraldin, I also face a similar problem.
16:49:21.958 INFO GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
16:49:21.958 INFO GenotypeGVCFs - Initializing engine
terminate called after throwing an instance of 'VariantQueryProcessorException'
what(): VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /seq/tmp/gatk4_test/test_21
I tried chmod 777 test21 directory (i.e GenomicDB) and resubmit the job. But I still get the same error as above.
My commad line as below:
/GenomeAnalysisTK-4.0/gatk-4.0.0.0/gatk GenotypeGVCFs -R /mnt/gatk
-bundle/2.8/b37/human_g1k_v37_decoy.fasta -V gendb://test_21 -O test_output.vcf -L 21
Issue · Github
by Sheila
Hello,
Has anybody figured out a solution for this error? I'm facing a similar problem using GATK:4.0.1.0 docker image on a local server.
I have tried to modify the directory permissions to 755 and the file permissions to
655, but it does not make any difference. I get the following error:
Thanks,
Yatros
@Lavanya @Yatros
Hi,
Perhaps this issue will have some helpful hints. @Lavanya I am not sure, but perhaps this thread may help. It looks like Yatros is already using small intervals.
Let me check with the team on other solutions and get back to you.
-Sheila
I'm having the same problem using v4.0.0. A rerun sometimes can fix the problem. The reason remains unknown.
Hi @Sheila ,
I don't think the size of the intervals is an issue.
I tried to combine 20 gvcf files of a sequencing panel containing 48 genes using an interval per sequenced exon that I have combined before with the GATK4 beta.6 version successfully. I tried to use the same methodology and commands (GenomicsDBImport + GenotypeGVCFs) and I cannot combine these 20 samples anymore with the GATK v4.0.1.0 / v4.0.1.2 releases.
When I compare the output of both runs for the first interval of these 20 samples, these are the differences I can see regarding the files that are generated in each folder using each version:
GATK4.beta.6 generated the following files in the genomicsdb folder:
- __tiledb_workspace.tdb - 0 B
- callset.json - 1,846 KB
- vidmap.json - 8,264 KB
GATK4.beta.6 generated the following files in the genomicsdb_array subfolder:
- __array_schema.tdb - 631B
- genomicsdb_meta.json - 48 B
GATK4.0.1.0 generates the following files in the genomicsdb folder:
- __tiledb_workspace.tdb - 0 B
- callset.json - 1,846 KB
- vcfheader.vcf - 16,719 KB (Novel file)
- vidmap.json - 9,132 KB
GATK4.0.1.0 generates the following files in the genomicsdb_array subfolder:
- .__consolidation_lock - 0 B (Novel file)
- __array_schema.tdb - 587 B
- genomicsdb_meta.json - 48 B
It is like one of the new generated files was not recognized correctly by the GenotypeGVCFs command in the current GATK4 version.
One thing that attracts my attention is that the generated vcfheader.vcf file does not contain any sample names. I'm not sure if they should be added to the vcfheader file in this step or later on when you merge the actual genotypes.
I can send you the files from both runs if you need them.
Thanks for your help,
Best,
Oswaldo
@Sheila
Hi Sheila,
Thanks for your response.
The earlier testing failed on Lustre FS.
When we try running "GenotypeGVCFs" on GPFS it seems to work.
Could it be because of file locking mechanisms? Could you please confirm? Thanks.
@jianxinwang @Yatros @Lavanya
Hi everyone,
Thanks for the information. I will pass this on to the developers and get back to you soon.
-Sheila
@jianxinwang @Yatros @Lavanya @md1jale
Hi again,
While I am waiting to hear back from the developer, can someone confirm this happens with the very latest version of GATK4?
Thanks,
Sheila
@Sheila , I am testing with gatk-4.0.0.0 version
Can you test it with GATK 4.0.2.1 ?
Hello,
I have tested the GenotypeGVCFs command in the GATK 4.0.2.1 version with a single interval multiple times (each time using a different interval) and I get exactly the same error all the time:
Before the final release of the GATK 4.0.0.0, I was able to merge GenomicsDBImport output files with the GATK4.beta6 version, but I'm not able to output a single vcf file with the beta version anymore.
I have been stuck at this step for several weeks right now. Is there any way of getting around it?
Thank you very much,
Best,
Yatros
Hello,
I have tested the GenotypeGVCFs command in the GATK 4.0.2.1 version with a single interval multiple times (each time using a different interval) and I get exactly the same error all the time:
Before the final release of the GATK 4.0.0.0, I was able to merge GenomicsDBImport output files with the GATK4.beta6 version, but I'm not able to output a single vcf file with the beta version anymore.
I have been stuck at this step for several weeks right now. Is there any way of getting around it?
Thank you very much,
Best,
Yatros
Hello,
I don't know if this can help somebody. Just for your records.
I was using the GATK4 docker container in a Centos 7.3 machine with XFS filesystem and I was getting the
Could not open array genomicsdb_array at workspace:
error all the time. My OS volume was 22 GB, probably too small for Docker and containers.Now I am using GATK4 docker container in an updated Ubuntu 16.04 machine. Standard server installation (not desktop) and XFS filesystem. My main volume is 80 GB. The VM has 16 cores on AMD 6276 dual CPU system. The error does not pop up anymore.
The OS volume space may have been my main problem.
Best,
Yatros
@Yatros
Hi Yatros,
Thank you for sharing! @jianxinwang @Lavanya @md1jale Can you all see if this helps?
Thanks,
Sheila
I am running this on the cluster (CentOS Linux release 7.4.1708 (Core)) with :
Filesystem Size Used Avail Use% Mounted on
[email protected]:[email protected]:/lustre 669T 133T 502T 21% /mnt/fastdata
Doesn't seem like a lack of space issue
Hi @md1jale et al.,
We're sorry you've had difficulties with using GenomicsDB. One thing that might help is our new
gatk-workflows
repository script that covers this step called gatk4-germline-snps-indels. This repository's WDL scripts outline the commands and version of Docker images (containing versions of programs, e.g. GATK4.0.0.0) that have been verified to work. It appears that there are two versions of scripts that cover GenomicsDB:The
fc
label means the script is for FireCloud. Note that the matched Docker versions are specified in the accompanying JSON input files.Hi,
I'm having the same issue when I run GenotypeGVCFs with the latest GATK version on a Lustre filesystem. The thing is that when I run it on a Lustre FS without file locking activated I get the error, but when I run it on other filesystems I don't. Is this a known issue? Thanks!
@cluengo
Hi,
I don't know if this is a known issue. I have not seen other reports of this. However, have a look at this thread. Perhaps some other users can jump in who use lustre. Also, "when I run it on a Lustre FS without file locking activated I get the error". What do you mean by this? I saw we used to have a flag to stop file locking in GATK3, but it does not seem to exist in GATK4.
Thanks,
Sheila
Hi @Sheila,
Thanks for the reply. What I meant is that the Lustre FS we use doesn't have file locking activated, so it can't be done. I don't know if GATK4 always tries to do it by default, but when I run the command to write on a NFS it does work...that's why we were suspicious on the file locking.
Thanks,
Cristina.
@cluengo
Hi Cristina,
I am checking with the team if there is anything they can help with. I am hoping someone else who uses Lustre can jump in as well.
-Sheila
Thanks @Sheila !
Hi @cluengo and others,
Sorry you're having problems with this. It sounds like this thread is describing probably several underlying issues, all of which are presenting with very vague and insufficient error messages.
I've opened a bug report to track this and asked for help from the GenomicsDB developers who will have much more insight into this.
Hopefully we can resolve the problems and allow it to run on a lustre system, although I don't know if we have access to one so testing it may be tricky. If for some reason file locking is absolutely required to run genomicsDB, then at a minimum we should improve the error messages so it's clear what's wrong.
Louis
Hi I got this error with version 4.0.0.0 and updated to the newest version (4.0.8.1) but couldn't solve the problem. Any ideas what help at this stage?
gatk GenotypeGVCFs -R /home/mass/GRD/ndsaraujo/damona_p/nobackup/working_GC/bosTau6.fasta -V gendb://genomicsdb_chr1_2_2018-09-03 --output chr1_MainPed_jcall_2018-09-07.vcf
Using GATK jar /home/mass/ifilesets/ULG/u227610/programs/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/mass/ifilesets/ULG/u227610/programs/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar GenotypeGVCFs -R /home/mass/GRD/ndsaraujo/damona_p/nobackup/working_GC/bosTau6.fasta -V gendb://genomicsdb_chr1_2_2018-09-03 --output chr1_MainPed_jcall_2018-09-07.vcf
12:31:49.395 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mass/ifilesets/ULG/u227610/programs/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:31:49.537 INFO GenotypeGVCFs - ------------------------------------------------------------
12:31:49.537 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.8.1
12:31:49.537 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
12:31:49.537 INFO GenotypeGVCFs - Executing as [email protected]aster01 on Linux v3.10.0-327.4.5.el7.x86_64 amd64
12:31:49.537 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
12:31:49.538 INFO GenotypeGVCFs - Start Date/Time: September 12, 2018 12:31:49 PM CEST
12:31:49.538 INFO GenotypeGVCFs - ------------------------------------------------------------
12:31:49.538 INFO GenotypeGVCFs - ------------------------------------------------------------
12:31:49.538 INFO GenotypeGVCFs - HTSJDK Version: 2.16.0
12:31:49.538 INFO GenotypeGVCFs - Picard Version: 2.18.7
12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:31:49.538 INFO GenotypeGVCFs - Deflater: IntelDeflater
12:31:49.538 INFO GenotypeGVCFs - Inflater: IntelInflater
12:31:49.539 INFO GenotypeGVCFs - GCS max retries/reopens: 20
12:31:49.539 INFO GenotypeGVCFs - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX
12:31:49.539 INFO GenotypeGVCFs - Initializing engine
terminate called after throwing an instance of 'VariantQueryProcessorException'
what(): VariantQueryProcessorException : Could not open array chr1$1$158337067 at workspace: /home/mass/ifilesets/URT/UGU_DAM/nobackup/working_GC/GVCFS/genomicsdb_chr1_2_2018-09-03
TileDB error message : [TileDB::StorageManager] Error: Cannot lock consolidation filelock; Cannot lock
I've noticed that GenomicsDBImport generates different files for the different versions:
version 4.0.0.0:
__tiledb_workspace.tdb
version 4.0.8.1:
callset.json chr1$1$158337067 __tiledb_workspace.tdb vcfheader.vcf vidmap.json
This changes slightly the error message (this is for version 4.0.0.0):
A USER ERROR has occurred: Couldn't connect to GenomicsDB because the vidmap, callset JSON files, or gVCF Header (vidmap.json,callset.json,vcfheader.vcf) could not be read from GenomicsDB workspace /home/mass/ifilesets/URT/UGU_DAM/nobackup/working_GC/GVCFS/genomicsdb_chr1_2018-09-03
But I'm still not able to run the GenotypeGVCFs.
Hi @nast2bee,
Can you check to see if SelectVariants works on your database? Here's an example command from one of our workshop tutorials:
If this works, we can look further into GenotypeGVCFs. Otherwise, we will have to look at your import command. Towards this, we would ask for some test data. You can follow instructions at https://software.broadinstitute.org/gatk/guide/article?id=1894 towards the latter.
Hi,
I am using version 4.0.11.0 and encountering the same issue as nast2bee. I have tried both GenotypeGVCFs and SelectVariants and I am get the same error message.
My input file was generated using GenomicsDBImport w/ multi-intervals
``
Using GATK jar /usr/local/apps/eb/GATK/4.0.11.0-foss-2016b-Python-2.7.14/gatk-package-4.0.11.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -jar /usr/local/apps/eb/GATK/4.0.11.0-foss-2016b-Python-2.7.14/gatk-package-4.0.11.0-local.jar SelectVariants -R A1163.fa -V gendb://GVCF_S5.dir -O GVCF_S5_combined.vcf
22:40:35.183 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/apps/eb/GATK/4.0.11.0-foss-2016b-Python-2.7.14/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
22:40:37.000 INFO SelectVariants - ------------------------------------------------------------
22:40:37.000 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.0.11.0
22:40:37.000 INFO SelectVariants - For support and documentation go to h ttps://software.broadinstitute.org/gatk/
22:40:37.001 INFO SelectVariants - Executing as [email protected] on Linux v3.10.0-229.20.1.el7.x86_64 amd64
22:40:37.001 INFO SelectVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_144-b01
22:40:37.001 INFO SelectVariants - Start Date/Time: November 15, 2018 10:40:35 PM EST
22:40:37.001 INFO SelectVariants - ------------------------------------------------------------
22:40:37.001 INFO SelectVariants - ------------------------------------------------------------
22:40:37.001 INFO SelectVariants - HTSJDK Version: 2.16.1
22:40:37.001 INFO SelectVariants - Picard Version: 2.18.13
22:40:37.001 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:40:37.002 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:40:37.002 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:40:37.002 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:40:37.002 INFO SelectVariants - Deflater: IntelDeflater
22:40:37.002 INFO SelectVariants - Inflater: IntelInflater
22:40:37.002 INFO SelectVariants - GCS max retries/reopens: 20
22:40:37.002 INFO SelectVariants - Requester pays: disabled
22:40:37.002 INFO SelectVariants - Initializing engine
terminate called after throwing an instance of 'VariantQueryProcessorException'
what(): VariantQueryProcessorException : Could not open array DS499644$1$1161 at workspace: /lustre1/sek53827/GVCF_S5.dir
TileDB error message : [TileDB::StorageManager] Error: Cannot lock consolidation filelock; Cannot lock
``
HI @sekang2
Sorry i missed this question. Would you please post the exact command you are using and I will look into this issue for you.
Also were you able to duplicate this issue?
Regards
Bhanu
HI @sekang2
We have not heard from you in 2 business days and hence will be closing this issue now.
Regards
Bhanu
Hi @sekang2
I encountered the same problem as you (my GATK version is 4.0.3.0; file system is lustre).
Now I solved the problem by using GATK4.0.12.0 and set environment variables
export TILEDB_DISABLE_FILE_LOCKING=1
.I hope it helps you.
Regards
Weichi
@Weichi
That you for the update
Thats great to hear and thank you for the update! @nast2bee