"Could not open array genomicsdb_array at workspace" from GenotypeGVCFs in GATK 4.0.0.0

edited January 11 in Ask the GATK team

I experience Issues with GenotypeGVCFs and GenomicsDB input in the final GATK4 release using the official Docker image. This does not occur using the 4.beta.6 release. It looks like there has been a bug in an earlier beta release with the same error message which got fixed. Is my issue related to that old bug or just results in the same error message? What can I do to debug the issue?

2018-01-10T12:15:04.154516155Z terminate called after throwing an instance of 'VariantQueryProcessorException'
2018-01-10T12:15:04.154547266Z   what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db
2018-01-10T12:15:04.154561314Z 
2018-01-10T12:15:04.620517615Z Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
2018-01-10T12:15:04.620517615Z Running:
2018-01-10T12:15:04.620517615Z     /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -V gendb:///keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db --output chr22_db.vcf --reference /keep/db91e5f04cbd9018e42708316c28e82d+2160/hg19.fa
Post edited by moritzgilsdorf on

Answers

  • SkyWarriorSkyWarrior TurkeyMember

    Can you try with just gendb:// e.g. gendb://keep ?

    I can use GenotypeGVCFs with this way without any issues.

  • This doesn't work either.

    2018-01-11T10:25:35.214044875Z [January 11, 2018 10:25:35 AM UTC] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.00 minutes.
    2018-01-11T10:25:35.214061939Z Runtime.totalMemory()=1530396672
    2018-01-11T10:25:35.214142072Z ***********************************************************************
    2018-01-11T10:25:35.214159746Z 
    2018-01-11T10:25:35.214159746Z 
    2018-01-11T10:25:35.214236978Z A USER ERROR has occurred: Couldn't connect to GenomicsDB because the vidmap, callset JSON files, or gVCF Header (vidmap.json,callset.json,vcfheader.vcf) could not be read from GenomicsDB workspace /keep/d22f668d4f44631d98bc650d582975ca+1399
    2018-01-11T10:25:35.214250274Z 
    2018-01-11T10:25:35.214250274Z 
    2018-01-11T10:25:35.214250274Z ***********************************************************************
    2018-01-11T10:25:35.214256605Z 
    2018-01-11T10:25:35.214406483Z Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
    2018-01-11T10:25:35.214417500Z 
    2018-01-11T10:25:35.233062029Z Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    2018-01-11T10:25:35.233062029Z Running:
    2018-01-11T10:25:35.233062029Z     /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -V gendb:///keep/d22f668d4f44631d98bc650d582975ca+1399 --output output.vcf --reference /keep/db91e5f04cbd9018e42708316c28e82d+2160/hg19.fa
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @moritzgilsdorf

    It looks like you have three slashes in your path instead of two:

    -V gendb:///keep/d22f668d4f44631d98bc650d582975ca+1399
    

    should be

    -V gendb://keep/d22f668d4f44631d98bc650d582975ca+1399
    
  • @Geraldine_VdAuwera

    That is not the problem. The absolute path for the genomicsdb is

    e.g. /keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db
    

    If I remove one slash the workflow tries to access it relative to the working directory.

    Other arguments that the call itself is not wrong:
    1. The same execution works when using the 4.beta.6 docker
    2. If I point the gendb to a existing but invalid directory (not containing a genomicsdb) I see something like this:

    A USER ERROR has occurred: Couldn't connect to GenomicsDB because the vidmap, callset JSON files, or gVCF Header (vidmap.json,callset.json,vcfheader.vcf) could not be read from GenomicsDB workspace /keep/d22f668d4f44631d98bc650d582975ca+1399
    

    This looks like the error in my initial posting indicates that the genomics db could be opened partially (found vidmap.json,callset.json,vcfheader.vcf) but has some issues with the genomicsdb_array subdirectory

  • I did some additional tests and believe this issue is related to the fact that in my setup the genomicsdb is read only. It works if the whole directory and all contained files are writable. If they are not writable, it crashes with the mentioned error message. I will try to work around this somehow which will be a hack as I'm using Arvados with CWL and the input oft this Workflow step is read_only by default.

    I'm wondering why the genomics_db is required to be writable. I checked the timestamps for the files in the genomics_db directory structure after a successful execution directly using Docker and it looks like there is no modification anywhere. Therefore reading only should not cause an error, in my opinion.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh that’s interesting — we’ll check with the dev team.

  • LavanyaLavanya Member
    edited February 23

    Hi Geraldin, I also face a similar problem.

    16:49:21.958 INFO GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    16:49:21.958 INFO GenotypeGVCFs - Initializing engine
    terminate called after throwing an instance of 'VariantQueryProcessorException'
    what(): VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /seq/tmp/gatk4_test/test_21

    I tried chmod 777 test21 directory (i.e GenomicDB) and resubmit the job. But I still get the same error as above.

    My commad line as below:

    /GenomeAnalysisTK-4.0/gatk-4.0.0.0/gatk GenotypeGVCFs -R /mnt/gatk
    -bundle/2.8/b37/human_g1k_v37_decoy.fasta -V gendb://test_21 -O test_output.vcf -L 21

    Issue · Github
    by Sheila

    Issue Number
    2970
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • YatrosYatros Seattle, WA, USAMember

    Hello,

    Has anybody figured out a solution for this error? I'm facing a similar problem using GATK:4.0.1.0 docker image on a local server.

    I have tried to modify the directory permissions to 755 and the file permissions to
    655, but it does not make any difference. I get the following error:

    21:42:19.796 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.1.0
    21:42:19.796 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    21:42:19.797 INFO  GenotypeGVCFs - Executing as [email protected] on Linux v3.10.0-693.17.1.el7.x86_64 amd64
    21:42:19.797 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
    21:42:19.798 INFO  GenotypeGVCFs - Start Date/Time: February 26, 2018 9:42:18 PM UTC
    21:42:19.798 INFO  GenotypeGVCFs - ------------------------------------------------------------
    21:42:19.798 INFO  GenotypeGVCFs - ------------------------------------------------------------
    21:42:19.799 INFO  GenotypeGVCFs - HTSJDK Version: 2.14.1
    21:42:19.799 INFO  GenotypeGVCFs - Picard Version: 2.17.2
    21:42:19.799 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    21:42:19.799 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    21:42:19.799 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    21:42:19.800 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    21:42:19.800 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    21:42:19.800 INFO  GenotypeGVCFs - Inflater: IntelInflater
    21:42:19.800 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
    21:42:19.800 INFO  GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    21:42:19.800 INFO  GenotypeGVCFs - Initializing engine
    21:42:20.878 INFO  FeatureManager - Using codec VCFCodec to read file file:///cromwell-executions/JointGenotyping/09ad8f69-8cc8-491b-92de-2b35cc5270f2/call-GenotypeGVCFs/shard-24/inputs/mnt/user/Test_goldstandard/dbsnp_138.b37.vcf
    terminate called after throwing an instance of 'VariantQueryProcessorException'
      what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /cromwell-executions/JointGenotyping/09ad8f69-8cc8-491b-92de-2b35cc5270f2/call-GenotypeGVCFs/shard-24/execution/genomicsdb
    Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    Running:
        /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -R /cromwell-executions/JointGenotyping/09ad8f69-8cc8-491b-92de-2b35cc5270f2/call-GenotypeGVCFs/shard-24/inputs/mnt/user/Test_reference/human_g1k_v37_decoy.fasta -O output.vcf.gz -D /cromwell-executions/JointGenotyping/09ad8f69-8cc8-491b-92de-2b35cc5270f2/call-GenotypeGVCFs/shard-24/inputs/mnt/user/Test_goldstandard/dbsnp_138.b37.vcf -G StandardAnnotation --only-output-calls-starting-in-intervals -new-qual -V gendb://genomicsdb -L MT:1-16569
    

    Thanks,

    Yatros

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Lavanya @Yatros
    Hi,

    Perhaps this issue will have some helpful hints. @Lavanya I am not sure, but perhaps this thread may help. It looks like Yatros is already using small intervals.

    Let me check with the team on other solutions and get back to you.

    -Sheila

  • I'm having the same problem using v4.0.0. A rerun sometimes can fix the problem. The reason remains unknown.

  • YatrosYatros Seattle, WA, USAMember

    Hi @Sheila ,

    I don't think the size of the intervals is an issue.

    I tried to combine 20 gvcf files of a sequencing panel containing 48 genes using an interval per sequenced exon that I have combined before with the GATK4 beta.6 version successfully. I tried to use the same methodology and commands (GenomicsDBImport + GenotypeGVCFs) and I cannot combine these 20 samples anymore with the GATK v4.0.1.0 / v4.0.1.2 releases.

    When I compare the output of both runs for the first interval of these 20 samples, these are the differences I can see regarding the files that are generated in each folder using each version:

    GATK4.beta.6 generated the following files in the genomicsdb folder:
    - __tiledb_workspace.tdb - 0 B
    - callset.json - 1,846 KB
    - vidmap.json - 8,264 KB

    GATK4.beta.6 generated the following files in the genomicsdb_array subfolder:
    - __array_schema.tdb - 631B
    - genomicsdb_meta.json - 48 B

    GATK4.0.1.0 generates the following files in the genomicsdb folder:
    - __tiledb_workspace.tdb - 0 B
    - callset.json - 1,846 KB
    - vcfheader.vcf - 16,719 KB (Novel file)
    - vidmap.json - 9,132 KB

    GATK4.0.1.0 generates the following files in the genomicsdb_array subfolder:
    - .__consolidation_lock - 0 B (Novel file)
    - __array_schema.tdb - 587 B
    - genomicsdb_meta.json - 48 B

    It is like one of the new generated files was not recognized correctly by the GenotypeGVCFs command in the current GATK4 version.

    One thing that attracts my attention is that the generated vcfheader.vcf file does not contain any sample names. I'm not sure if they should be added to the vcfheader file in this step or later on when you merge the actual genotypes.

    I can send you the files from both runs if you need them.

    Thanks for your help,

    Best,

    Oswaldo

  • LavanyaLavanya Member
    edited March 1

    @Sheila said:
    @Lavanya @Yatros
    Hi,

    Perhaps this issue will have some helpful hints. @Lavanya I am not sure, but perhaps this thread may help. It looks like Yatros is already using small intervals.

    Let me check with the team on other solutions and get back to you.

    -Sheila

    @Sheila
    Hi Sheila,
    Thanks for your response.
    The earlier testing failed on Lustre FS.
    When we try running "GenotypeGVCFs" on GPFS it seems to work.
    Could it be because of file locking mechanisms? Could you please confirm? Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jianxinwang @Yatros @Lavanya
    Hi everyone,

    Thanks for the information. I will pass this on to the developers and get back to you soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jianxinwang @Yatros @Lavanya @md1jale
    Hi again,

    While I am waiting to hear back from the developer, can someone confirm this happens with the very latest version of GATK4?

    Thanks,
    Sheila

  • LavanyaLavanya Member
    edited March 7

    @Sheila , I am testing with gatk-4.0.0.0 version

  • SkyWarriorSkyWarrior TurkeyMember

    Can you test it with GATK 4.0.2.1 ?

  • YatrosYatros Seattle, WA, USAMember

    Hello,

    I have tested the GenotypeGVCFs command in the GATK 4.0.2.1 version with a single interval multiple times (each time using a different interval) and I get exactly the same error all the time:

    terminate called after throwing an instance of 'VariantQueryProcessorException'
      what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /mnt/user/Project_DB
    Aborted (core dumped)
    

    Before the final release of the GATK 4.0.0.0, I was able to merge GenomicsDBImport output files with the GATK4.beta6 version, but I'm not able to output a single vcf file with the beta version anymore.

    I have been stuck at this step for several weeks right now. Is there any way of getting around it?

    Thank you very much,

    Best,

    Yatros

  • YatrosYatros Seattle, WA, USAMember

    Hello,

    I have tested the GenotypeGVCFs command in the GATK 4.0.2.1 version with a single interval multiple times (each time using a different interval) and I get exactly the same error all the time:

    terminate called after throwing an instance of 'VariantQueryProcessorException'
      what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /mnt/user/Project_DB
    Aborted (core dumped)
    

    Before the final release of the GATK 4.0.0.0, I was able to merge GenomicsDBImport output files with the GATK4.beta6 version, but I'm not able to output a single vcf file with the beta version anymore.

    I have been stuck at this step for several weeks right now. Is there any way of getting around it?

    Thank you very much,

    Best,

    Yatros

  • YatrosYatros Seattle, WA, USAMember

    Hello,

    I don't know if this can help somebody. Just for your records.

    I was using the GATK4 docker container in a Centos 7.3 machine with XFS filesystem and I was getting the Could not open array genomicsdb_array at workspace: error all the time. My OS volume was 22 GB, probably too small for Docker and containers.

    Now I am using GATK4 docker container in an updated Ubuntu 16.04 machine. Standard server installation (not desktop) and XFS filesystem. My main volume is 80 GB. The VM has 16 cores on AMD 6276 dual CPU system. The error does not pop up anymore.

    The OS volume space may have been my main problem.

    Best,

    Yatros

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Yatros
    Hi Yatros,

    Thank you for sharing! @jianxinwang @Lavanya @md1jale Can you all see if this helps?

    Thanks,
    Sheila

  • md1jalemd1jale Member

    I am running this on the cluster (CentOS Linux release 7.4.1708 (Core)) with :

    Filesystem Size Used Avail Use% Mounted on
    [email protected]:[email protected]:/lustre 669T 133T 502T 21% /mnt/fastdata

    Doesn't seem like a lack of space issue

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @md1jale et al.,

    We're sorry you've had difficulties with using GenomicsDB. One thing that might help is our new gatk-workflows repository script that covers this step called gatk4-germline-snps-indels. This repository's WDL scripts outline the commands and version of Docker images (containing versions of programs, e.g. GATK4.0.0.0) that have been verified to work. It appears that there are two versions of scripts that cover GenomicsDB:

    Thefc label means the script is for FireCloud. Note that the matched Docker versions are specified in the accompanying JSON input files.

  • cluengocluengo Member

    Hi,

    I'm having the same issue when I run GenotypeGVCFs with the latest GATK version on a Lustre filesystem. The thing is that when I run it on a Lustre FS without file locking activated I get the error, but when I run it on other filesystems I don't. Is this a known issue? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @cluengo
    Hi,

    I don't know if this is a known issue. I have not seen other reports of this. However, have a look at this thread. Perhaps some other users can jump in who use lustre. Also, "when I run it on a Lustre FS without file locking activated I get the error". What do you mean by this? I saw we used to have a flag to stop file locking in GATK3, but it does not seem to exist in GATK4.

    Thanks,
    Sheila

  • cluengocluengo Member

    Hi @Sheila,

    Thanks for the reply. What I meant is that the Lustre FS we use doesn't have file locking activated, so it can't be done. I don't know if GATK4 always tries to do it by default, but when I run the command to write on a NFS it does work...that's why we were suspicious on the file locking.

    Thanks,

    Cristina.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @cluengo
    Hi Cristina,

    I am checking with the team if there is anything they can help with. I am hoping someone else who uses Lustre can jump in as well.

    -Sheila

  • LouisBLouisB Broad InstituteMember, Broadie, Dev

    Hi @cluengo and others,

    Sorry you're having problems with this. It sounds like this thread is describing probably several underlying issues, all of which are presenting with very vague and insufficient error messages.

    I've opened a bug report to track this and asked for help from the GenomicsDB developers who will have much more insight into this.

    Hopefully we can resolve the problems and allow it to run on a lustre system, although I don't know if we have access to one so testing it may be tricky. If for some reason file locking is absolutely required to run genomicsDB, then at a minimum we should improve the error messages so it's clear what's wrong.
    Louis

  • Hi I got this error with version 4.0.0.0 and updated to the newest version (4.0.8.1) but couldn't solve the problem. Any ideas what help at this stage?

    >

    gatk GenotypeGVCFs -R /home/mass/GRD/ndsaraujo/damona_p/nobackup/working_GC/bosTau6.fasta -V gendb://genomicsdb_chr1_2_2018-09-03 --output chr1_MainPed_jcall_2018-09-07.vcf

    Using GATK jar /home/mass/ifilesets/ULG/u227610/programs/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/mass/ifilesets/ULG/u227610/programs/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar GenotypeGVCFs -R /home/mass/GRD/ndsaraujo/damona_p/nobackup/working_GC/bosTau6.fasta -V gendb://genomicsdb_chr1_2_2018-09-03 --output chr1_MainPed_jcall_2018-09-07.vcf
    12:31:49.395 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/mass/ifilesets/ULG/u227610/programs/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    12:31:49.537 INFO GenotypeGVCFs - ------------------------------------------------------------
    12:31:49.537 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.8.1
    12:31:49.537 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    12:31:49.537 INFO GenotypeGVCFs - Executing as [email protected] on Linux v3.10.0-327.4.5.el7.x86_64 amd64
    12:31:49.537 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
    12:31:49.538 INFO GenotypeGVCFs - Start Date/Time: September 12, 2018 12:31:49 PM CEST
    12:31:49.538 INFO GenotypeGVCFs - ------------------------------------------------------------
    12:31:49.538 INFO GenotypeGVCFs - ------------------------------------------------------------
    12:31:49.538 INFO GenotypeGVCFs - HTSJDK Version: 2.16.0
    12:31:49.538 INFO GenotypeGVCFs - Picard Version: 2.18.7
    12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:31:49.538 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:31:49.538 INFO GenotypeGVCFs - Deflater: IntelDeflater
    12:31:49.538 INFO GenotypeGVCFs - Inflater: IntelInflater
    12:31:49.539 INFO GenotypeGVCFs - GCS max retries/reopens: 20
    12:31:49.539 INFO GenotypeGVCFs - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX
    12:31:49.539 INFO GenotypeGVCFs - Initializing engine
    terminate called after throwing an instance of 'VariantQueryProcessorException'
    what(): VariantQueryProcessorException : Could not open array chr1$1$158337067 at workspace: /home/mass/ifilesets/URT/UGU_DAM/nobackup/working_GC/GVCFS/genomicsdb_chr1_2_2018-09-03
    TileDB error message : [TileDB::StorageManager] Error: Cannot lock consolidation filelock; Cannot lock

    >

    I've noticed that GenomicsDBImport generates different files for the different versions:

    • version 4.0.0.0:
      __tiledb_workspace.tdb

    • version 4.0.8.1:
      callset.json chr1$1$158337067 __tiledb_workspace.tdb vcfheader.vcf vidmap.json

    This changes slightly the error message (this is for version 4.0.0.0):

    >

    A USER ERROR has occurred: Couldn't connect to GenomicsDB because the vidmap, callset JSON files, or gVCF Header (vidmap.json,callset.json,vcfheader.vcf) could not be read from GenomicsDB workspace /home/mass/ifilesets/URT/UGU_DAM/nobackup/working_GC/GVCFS/genomicsdb_chr1_2018-09-03

    >

    But I'm still not able to run the GenotypeGVCFs.

Sign In or Register to comment.