We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenomicsDBImport error

I'm trying to test GenomicDBImport with a small set of gVCF samples (gatk-4.0.10.1). Here is the command I ran:
gatk GenomicsDBImport \
-V 1.gvcf.gz \
-V 2.gvcf.gz \
-V 3.gvcf.gz \
-V 4.gvcf.gz \
-V 5.gvcf.gz \
-V 6.gvcf.gz \
--genomicsdb-workspace-path test_database \
--intervals 8:41000000-42000000
And I got a bunch of error messages and eventually tracked down to these lines:
#
A fatal error has been detected by the Java Runtime Environment:
#
SIGSEGV (0xb) at pc=0x00002aab94159809, pid=3156, tid=0x00002aab5bae2700
#
JRE version: OpenJDK Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
Java VM: OpenJDK 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [libtiledbgenomicsdb6069813449664720959.so+0x159809] BufferVariantCell::set_cell(void const*)+0x99
#
Core dump written. Default location: /my_directory/core or core.3156
This looks similar to the issue reported here:
GenomicsDBImport: A fatal error has been detected by the Java Runtime Environment #5045
https://github.com/broadinstitute/gatk/issues/5045#issuecomment-407476343
Following discussions in that thread, I ran "vcf_validator" for my gVCF files and got this (for 1.gvcf.gz):
Error: ALT metadata ID does not begin with DEL/INS/DUP/INV/CNV. This occurs 1 time(s), first time in line 2.
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 93311797 time(s), first time in line 631.
Error: Alternate ID is not prefixed by DEL/INS/DUP/INV/CNV and suffixed by ':' and a text sequence. This occurs 20022413 time(s), first time in line 632.
At this point, I'm not sure whether these gVCF errors are causing the crash, and some of my gVCF files were not results of GATK HaplotypCaller, so that might be a factor, too.
If anyone could offer some advices on this problem, it would be great!
Thanks!
Ke
Best Answer
-
bhanuGandham Cambridge MA admin
Hi @biojiangke
Here are two options you can try:
1) Try and validate your g.vcfs by running ValidateVariants. It looks like they'd need to either be re-generated, or repaired. Without looking at the g.vcfs it's hard to tell how feasible it is to repair them.
The Intel library, TileDB(which is used by GenomicsDBImport) probably assumes the gvcfs are totally in spec, and it could be making assumptions that cause weird crashes.2) it may be worth trying CombineGVCFs instead of GenomicsDBImport, which may be less finicky about out-of-spec gvcfs. Also, it is recommended you use CombineGVCFs for smaller set of samples as it is in your case.
Let me know if this helps.
Regards
Bhanu
Answers
Hi @biojiangke
There's a chance you just ran out of memory. The tool docs suggests running with java options that set the amount of memory used a few GB less than the physical memory of the machine (because it calls an external library which also uses memory). E.g. if you system has 8 GB of memory, and you have 1 GB already used by the system, you'd probably want gatk to use only 4 GB (because that leaves around 3 left for the external library).
The syntax to call GenomicsDBImport with this restriction is:
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport [arguments to GenomicsDBImport]
Obviously if you have more than 8GB of memory you'd want to allocate more than that to gatk.
I reached out to the author who said that this is "worth a shot" but that the process may need more than 4GB (i.e. it may need a pretty beefy machine to run on).
I hope this helps.
Regards
Bhanu
Thanks for the quick response. With -Xms48g -Xmx48g, still got the same error. Considering there are only six gVCFs about 2G each and a relatively small slice of the genome (1Mbp), I think 48G memories would probably be enough?
Hi @biojiangke
Would you send me the specs of the computer you are running this on. Thank you.
Regards
Bhanu
I'm running on a node in a cluster
2x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
128GB Memory
CentOS 7.5
OpenJDK = 1.8.0.181-3.b13.el7_5
OracleJDK = 8u181
Hi @biojiangke
Can you post the error log you are getting with the xmx option. I am having developers look into this. The more information you can provide the faster we will be able to find a solution.
Regards
Bhanu
Here attaches the error log.
Hi @biojiangke
Here are two options you can try:
1) Try and validate your g.vcfs by running ValidateVariants. It looks like they'd need to either be re-generated, or repaired. Without looking at the g.vcfs it's hard to tell how feasible it is to repair them.
The Intel library, TileDB(which is used by GenomicsDBImport) probably assumes the gvcfs are totally in spec, and it could be making assumptions that cause weird crashes.
2) it may be worth trying CombineGVCFs instead of GenomicsDBImport, which may be less finicky about out-of-spec gvcfs. Also, it is recommended you use CombineGVCFs for smaller set of samples as it is in your case.
Let me know if this helps.
Regards
Bhanu
Thanks for the advices. I've been using CombineGVCFs for a while and it has been working well for us. Facing ever increasing amount of WGS data, I'd like to try GenomicsDBImport to gain some performance. I'll give ValidateVariants a try and report back if I find anything interesting.
Running the validation on just one gVCF file, I got the following error message:
A USER ERROR has occurred: A GVCF must cover the entire region. Found 54134140 loci with no VariantContext covering it. The first uncovered segment is:1:20000
Searched around for this message and seems not finding much informative advices on this. Is this something inherent to the gVCF calling process used, which means the gVCF is not up to the standard required by GenomicDBImport?
Hi @biojiangke
By definition gVCFs have records for each locus in the genome, even if there is no variant there (large runs with no variant can be represented by a "block" of homozygous-reference). See https://software.broadinstitute.org/gatk/documentation/article.php?id=4017
Your vcfs are not covering the whole genome: most likely they're actually regular vcfs and thus not suitable for pooling using GenomicsDBImport. Probably the best thing is to generate new gVCFs using HaplotypeCaller.
Follow this link for more info: What is a GVCF and how is it different from a 'regular' VCF?
Hope this helps.
Regards
Bhanu
Mmmm...Interesting, I'm pretty sure it is gVCF instead of VCF, as the first few lines shown here:
1 1 . G . . END=369 GT:DP:GQ:MIN_DP:PL 0/0:65:99:54:0,120,1800
1 370 . G A, 389.77 . DP=61;MQ=60;MQRankSum=2.068;ReadPosRankSum=-0.863;FractionInformativeReads=1 GT:AD:DP:GQ:PL:SB 0/1:46,15,0:61:99:418,0,1738,2339,1783,2339:20,26,5,10
1 371 . G . . END=453 GT:DP:GQ:MIN_DP:PL 0/0:69:87:60:0,87,1305
1 454 . C . . END=461 GT:DP:GQ:MIN_DP:PL 0/0:60:60:59:0,60,900
1 462 . A . . END=468 GT:DP:GQ:MIN_DP:PL 0/0:60:45:59:0,45,675
1 469 . C CT,CTT, 379.73 . DP=41;MQ=60;MQRankSum=0.146;ReadPosRankSum=-0.183;FractionInformativeReads=0.878 GT:AD:DP:GQ:PL:SB 0/2:9,5,22,0:36:99:673,417,551,0,119,229,674,566,228,674:2,7,10,17
It was generated by older versions of GATK, though. I'm not sure whether there is something making it incompatible with GenomicDBImport.
Hi @biojiangke
Which version of gatk did you use to generate these gvcfs? Can you send me the header of the gVCF file?
Also, is it possible for you to regenerate the gVCF with the latest version?
Regards
Bhanu
This was done using dragen germline pipeline.
http://edicogenome.com/pipelines/dragen-germline-v2-pipeline-2/
Unfortunately they would not share details about their variant caller but I believed they used a certain version of GATK.
Re-generating gVCFs from current HaplotypeCaller is possible, but with the number of individuals and samples we have, it would be a very big and expensive endeavor.
Thanks for all the help! I guess I'll live with CombineGVCFs for now.
@biojiangke
The issue definitely looks like something to do with the way gvcf was created. And GenomicsDBImport is very picky about its gvcfs.
If gvcfs are not created using g.vcf.gz extension reference confidence emission mode is not on therefore you don't get a proper g.vcf. Can you try using the filename.g.vcf.gz pattern for creating your gvcfs?
@SkyWarrior Are you talking about this option in HaplotypeCaller?
--emit-ref-confidence,-ERC:ReferenceConfidenceMode
The answer is I don't know whether the original gVCF was called with this option. But I'll give it a try. I think I have used this option before as -ERC:GVCF, but never used the resulted gVCF file for GenomicDBImport.
Thank you for your tip! Always happy to learn from everyone.