Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

GenomicsDBImport error

I'm trying to test GenomicDBImport with a small set of gVCF samples (gatk-4.0.10.1). Here is the command I ran:

gatk GenomicsDBImport \
-V 1.gvcf.gz \
-V 2.gvcf.gz \
-V 3.gvcf.gz \
-V 4.gvcf.gz \
-V 5.gvcf.gz \
-V 6.gvcf.gz \
--genomicsdb-workspace-path test_database \
--intervals 8:41000000-42000000

And I got a bunch of error messages and eventually tracked down to these lines:

#

A fatal error has been detected by the Java Runtime Environment:

#

SIGSEGV (0xb) at pc=0x00002aab94159809, pid=3156, tid=0x00002aab5bae2700

#

JRE version: OpenJDK Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)

Java VM: OpenJDK 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libtiledbgenomicsdb6069813449664720959.so+0x159809] BufferVariantCell::set_cell(void const*)+0x99

#

Core dump written. Default location: /my_directory/core or core.3156

This looks similar to the issue reported here:

GenomicsDBImport: A fatal error has been detected by the Java Runtime Environment #5045
https://github.com/broadinstitute/gatk/issues/5045#issuecomment-407476343

Following discussions in that thread, I ran "vcf_validator" for my gVCF files and got this (for 1.gvcf.gz):

Error: ALT metadata ID does not begin with DEL/INS/DUP/INV/CNV. This occurs 1 time(s), first time in line 2.
Error: Format is not a colon-separated list of alphanumeric strings. This occurs 93311797 time(s), first time in line 631.
Error: Alternate ID is not prefixed by DEL/INS/DUP/INV/CNV and suffixed by ':' and a text sequence. This occurs 20022413 time(s), first time in line 632.

At this point, I'm not sure whether these gVCF errors are causing the crash, and some of my gVCF files were not results of GATK HaplotypCaller, so that might be a factor, too.

If anyone could offer some advices on this problem, it would be great!

Thanks!

Ke

Best Answer

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @biojiangke

    There's a chance you just ran out of memory. The tool docs suggests running with java options that set the amount of memory used a few GB less than the physical memory of the machine (because it calls an external library which also uses memory). E.g. if you system has 8 GB of memory, and you have 1 GB already used by the system, you'd probably want gatk to use only 4 GB (because that leaves around 3 left for the external library).
    The syntax to call GenomicsDBImport with this restriction is:
    gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport [arguments to GenomicsDBImport]

    Obviously if you have more than 8GB of memory you'd want to allocate more than that to gatk.
    I reached out to the author who said that this is "worth a shot" but that the process may need more than 4GB (i.e. it may need a pretty beefy machine to run on).

    I hope this helps.

    Regards
    Bhanu

  • biojiangkebiojiangke Member ✭✭

    Thanks for the quick response. With -Xms48g -Xmx48g, still got the same error. Considering there are only six gVCFs about 2G each and a relatively small slice of the genome (1Mbp), I think 48G memories would probably be enough?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @biojiangke

    Would you send me the specs of the computer you are running this on. Thank you.

    Regards
    Bhanu

  • biojiangkebiojiangke Member ✭✭

    I'm running on a node in a cluster

    2x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
    128GB Memory
    CentOS 7.5
    OpenJDK = 1.8.0.181-3.b13.el7_5
    OracleJDK = 8u181

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited October 2018

    Hi @biojiangke

    Can you post the error log you are getting with the xmx option. I am having developers look into this. The more information you can provide the faster we will be able to find a solution.

    Regards
    Bhanu

  • biojiangkebiojiangke Member ✭✭

    Here attaches the error log.

  • biojiangkebiojiangke Member ✭✭

    Thanks for the advices. I've been using CombineGVCFs for a while and it has been working well for us. Facing ever increasing amount of WGS data, I'd like to try GenomicsDBImport to gain some performance. I'll give ValidateVariants a try and report back if I find anything interesting.

  • biojiangkebiojiangke Member ✭✭

    Running the validation on just one gVCF file, I got the following error message:

    A USER ERROR has occurred: A GVCF must cover the entire region. Found 54134140 loci with no VariantContext covering it. The first uncovered segment is:1:20000

    Searched around for this message and seems not finding much informative advices on this. Is this something inherent to the gVCF calling process used, which means the gVCF is not up to the standard required by GenomicDBImport?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @biojiangke

    By definition gVCFs have records for each locus in the genome, even if there is no variant there (large runs with no variant can be represented by a "block" of homozygous-reference). See https://software.broadinstitute.org/gatk/documentation/article.php?id=4017

    Your vcfs are not covering the whole genome: most likely they're actually regular vcfs and thus not suitable for pooling using GenomicsDBImport. Probably the best thing is to generate new gVCFs using HaplotypeCaller.

    Follow this link for more info: What is a GVCF and how is it different from a 'regular' VCF?

    Hope this helps.

    Regards
    Bhanu

  • biojiangkebiojiangke Member ✭✭

    Mmmm...Interesting, I'm pretty sure it is gVCF instead of VCF, as the first few lines shown here:

    1 1 . G . . END=369 GT:DP:GQ:MIN_DP:PL 0/0:65:99:54:0,120,1800
    1 370 . G A, 389.77 . DP=61;MQ=60;MQRankSum=2.068;ReadPosRankSum=-0.863;FractionInformativeReads=1 GT:AD:DP:GQ:PL:SB 0/1:46,15,0:61:99:418,0,1738,2339,1783,2339:20,26,5,10
    1 371 . G . . END=453 GT:DP:GQ:MIN_DP:PL 0/0:69:87:60:0,87,1305
    1 454 . C . . END=461 GT:DP:GQ:MIN_DP:PL 0/0:60:60:59:0,60,900
    1 462 . A . . END=468 GT:DP:GQ:MIN_DP:PL 0/0:60:45:59:0,45,675
    1 469 . C CT,CTT, 379.73 . DP=41;MQ=60;MQRankSum=0.146;ReadPosRankSum=-0.183;FractionInformativeReads=0.878 GT:AD:DP:GQ:PL:SB 0/2:9,5,22,0:36:99:673,417,551,0,119,229,674,566,228,674:2,7,10,17

    It was generated by older versions of GATK, though. I'm not sure whether there is something making it incompatible with GenomicDBImport.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @biojiangke

    Which version of gatk did you use to generate these gvcfs? Can you send me the header of the gVCF file?
    Also, is it possible for you to regenerate the gVCF with the latest version?

    Regards
    Bhanu

  • biojiangkebiojiangke Member ✭✭

    This was done using dragen germline pipeline.

    http://edicogenome.com/pipelines/dragen-germline-v2-pipeline-2/

    Unfortunately they would not share details about their variant caller but I believed they used a certain version of GATK.

    Re-generating gVCFs from current HaplotypeCaller is possible, but with the number of individuals and samples we have, it would be a very big and expensive endeavor.

    Thanks for all the help! I guess I'll live with CombineGVCFs for now.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @biojiangke

    The issue definitely looks like something to do with the way gvcf was created. And GenomicsDBImport is very picky about its gvcfs.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    If gvcfs are not created using g.vcf.gz extension reference confidence emission mode is not on therefore you don't get a proper g.vcf. Can you try using the filename.g.vcf.gz pattern for creating your gvcfs?

  • biojiangkebiojiangke Member ✭✭

    @SkyWarrior Are you talking about this option in HaplotypeCaller?

    --emit-ref-confidence,-ERC:ReferenceConfidenceMode

    The answer is I don't know whether the original gVCF was called with this option. But I'll give it a try. I think I have used this option before as -ERC:GVCF, but never used the resulted gVCF file for GenomicDBImport.

    Thank you for your tip! Always happy to learn from everyone.

Sign In or Register to comment.