We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenotypeGVCFs multiple gvcf bug

Hi all - I've tried using the GenotypeGVCFs in both 3.7 and 4.0, but get different errors in each. for 3.7 (nightly-2017-05-17-g44b6fa2). I get:
java -Xmx124g -jar ~/bin/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/elab/references/mus_musculus/Mus_musculus.GRCm38.dna_sm.fa -nt 19 --max_alternate_alleles 4 --variant Sample_147003/Sample_147003.g.vcf --variant Sample_148112/Sample_148112.g.vcf --variant Sample_151203/Sample_151203.g.vcf --variant Sample_206082/Sample_206082.g.vcf --variant Sample_206083/Sample_206083.g.vcf --variant Sample_212034/Sample_212034.g.vcf --variant Sample_213965/Sample_213965.g.vcf --variant Sample_214051/Sample_214051.g.vcf --variant sample_14831_3/sample_14831_3.g.vcf --variant sample_14851_3/sample_14851_3.g.vcf --variant sample_15183_5/sample_15183_5.g.vcf --variant sample_15639_2/sample_15639_2.g.vcf --variant sample_15640_3/sample_15640_3.g.vcf --variant sample_15675_4/sample_15675_4.g.vcf --variant sample_15708_2/sample_15708_2.g.vcf --variant sample_15714_3/sample_15714_3.g.vcf --variant sample_15744_2/sample_15744_2.g.vcf --variant sample_15746_3/sample_15746_3.g.vcf --variant sample_15766_3/sample_15766_3.g.vcf --variant sample_15791_4/sample_15791_4.g.vcf --variant sample_15811_3/sample_15811_3.g.vcf --variant sample_15822_3/sample_15822_3.g.vcf --variant sample_15836_2/sample_15836_2.g.vcf --variant sample_15836_3/sample_15836_3.g.vcf --variant sample_15858_5/sample_15858_5.g.vcf --variant sample_15871_3/sample_15871_3.g.vcf --variant sample_15875_2/sample_15875_2.g.vcf --variant sample_15876_3/sample_15876_3.g.vcf --variant sample_15876_5/sample_15876_5.g.vcf --variant sample_15914_2/sample_15914_2.g.vcf --variant sample_15915_3/sample_15915_3.g.vcf --variant sample_20544_1/sample_20544_1.g.vcf --variant sample_20549_5/sample_20549_5.g.vcf --variant sample_20551_1/sample_20551_1.g.vcf --variant sample_20551_3/sample_20551_3.g.vcf --variant sample_20605_2/sample_20605_2.g.vcf --variant sample_20605_4/sample_20605_4.g.vcf --variant sample_20605_5/sample_20605_5.g.vcf --variant sample_20836_5/sample_20836_5.g.vcf --variant sample_20952_2/sample_20952_2.g.vcf -o /media/elab/Seagate_Expansion_Drive_1/ms_exomes/final/all.combined.vcf INFO 09:32:54,680 HelpFormatter - --------------------------------------------------------------------------------------------- INFO 09:32:54,683 HelpFormatter - The Genome Analysis Toolkit (GATK) vnightly-2017-05-17-g44b6fa2, Compiled 2017/05/17 00:01:17 INFO 09:32:54,683 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 09:32:54,683 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 09:32:54,683 HelpFormatter - [Thu Jul 27 09:32:54 CDT 2017] Executing on Linux 4.4.0-87-generic amd64 INFO 09:32:54,684 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-2ubuntu1.16.04.2-b11 INFO 09:32:54,686 HelpFormatter - Program Args: -T GenotypeGVCFs -R /home/elab/references/mus_musculus/Mus_musculus.GRCm38.dna_sm.fa -nt 19 --max_alternate_alleles 4 --variant Sample_147003/Sample_147003.g.vcf --variant Sample_148112/Sample_148112.g.vcf --variant Sample_151203/Sample_151203.g.vcf --variant Sample_206082/Sample_206082.g.vcf --variant Sample_206083/Sample_206083.g.vcf --variant Sample_212034/Sample_212034.g.vcf --variant Sample_213965/Sample_213965.g.vcf --variant Sample_214051/Sample_214051.g.vcf --variant sample_14831_3/sample_14831_3.g.vcf --variant sample_14851_3/sample_14851_3.g.vcf --variant sample_15183_5/sample_15183_5.g.vcf --variant sample_15639_2/sample_15639_2.g.vcf --variant sample_15640_3/sample_15640_3.g.vcf --variant sample_15675_4/sample_15675_4.g.vcf --variant sample_15708_2/sample_15708_2.g.vcf --variant sample_15714_3/sample_15714_3.g.vcf --variant sample_15744_2/sample_15744_2.g.vcf --variant sample_15746_3/sample_15746_3.g.vcf --variant sample_15766_3/sample_15766_3.g.vcf --variant sample_15791_4/sample_15791_4.g.vcf --variant sample_15811_3/sample_15811_3.g.vcf --variant sample_15822_3/sample_15822_3.g.vcf --variant sample_15836_2/sample_15836_2.g.vcf --variant sample_15836_3/sample_15836_3.g.vcf --variant sample_15858_5/sample_15858_5.g.vcf --variant sample_15871_3/sample_15871_3.g.vcf --variant sample_15875_2/sample_15875_2.g.vcf --variant sample_15876_3/sample_15876_3.g.vcf --variant sample_15876_5/sample_15876_5.g.vcf --variant sample_15914_2/sample_15914_2.g.vcf --variant sample_15915_3/sample_15915_3.g.vcf --variant sample_20544_1/sample_20544_1.g.vcf --variant sample_20549_5/sample_20549_5.g.vcf --variant sample_20551_1/sample_20551_1.g.vcf --variant sample_20551_3/sample_20551_3.g.vcf --variant sample_20605_2/sample_20605_2.g.vcf --variant sample_20605_4/sample_20605_4.g.vcf --variant sample_20605_5/sample_20605_5.g.vcf --variant sample_20836_5/sample_20836_5.g.vcf --variant sample_20952_2/sample_20952_2.g.vcf -o /media/elab/Seagate_Expansion_Drive_1/ms_exomes/final/all.combined.vcf INFO 09:32:54,689 HelpFormatter - Executing as [email protected] on Linux 4.4.0-87-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-2ubuntu1.16.04.2-b11. INFO 09:32:54,689 HelpFormatter - Date/Time: 2017/07/27 09:32:54 INFO 09:32:54,690 HelpFormatter - --------------------------------------------------------------------------------------------- INFO 09:32:54,690 HelpFormatter - --------------------------------------------------------------------------------------------- ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/elab/bin/gatk/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console... INFO 09:32:54,857 GenomeAnalysisEngine - Deflater: IntelDeflater INFO 09:32:54,857 GenomeAnalysisEngine - Inflater: IntelInflater INFO 09:32:54,858 GenomeAnalysisEngine - Strictness is SILENT INFO 09:32:54,974 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 ##### ERROR -- ##### ERROR stack trace java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:187) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:165) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:375) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:359) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:319) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:264) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:153) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.<init>(ReferenceOrderedDataSource.java:208) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.<init>(ReferenceOrderedDataSource.java:88) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1074) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:851) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:294) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:181) ... 15 more Caused by: java.io.EOFException at htsjdk.tribble.util.LittleEndianInputStream.readFully(LittleEndianInputStream.java:138) at htsjdk.tribble.util.LittleEndianInputStream.readLong(LittleEndianInputStream.java:80) at htsjdk.tribble.index.interval.IntervalTreeIndex$ChrIndex.read(IntervalTreeIndex.java:203) at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:367) at htsjdk.tribble.index.interval.IntervalTreeIndex.<init>(IntervalTreeIndex.java:52) ... 20 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2017-05-17-g44b6fa2): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: java.lang.reflect.InvocationTargetException ##### ERROR ------------------------------------------------------------------------------------------
Notably however, if I feed in only a smaller subset of the gvcfs, I don't get this error.
In 4.0 on the other hand, it doesn't look like it can handle multiple gvcf inputs:
*********************************************************************** A USER ERROR has occurred: Argument '[V, variant]' cannot be specified more than once. ***********************************************************************
Is this just changed syntax between 3.7 and 4.0, or can 4.0 genuinely not perform joint genotyping on multiple gvcfs?
Thanks,
Scott
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
Ah, the log4j thing is probably a red herring -- that's a minor logging thing that shouldn't cause the run to actually fail, it just pollutes the log output (I think I saw some chatter about getting that fixed).
Possibly more relevant, the program is failing in a function that is supposed to read an index file, and one of the error types along the way is
java.io.EOFException
which translates to "end of file" -- this is generally seen when you have a file that is corrupted or incomplete. I would recommend checking the index files for all your GVCFs. Maybe try running on just one that you know is valid (eg because you can run a different tool on it, like ValidateVariants) and see if that works ok. If so it's a matter of checking all your files for a rotten index, and regenerating the bad file(s).I could be wrong but that seems the most likely problem/solution based on these errors.
Answers
Hi @scottyler89,
with the GATK4 Issue maybe I can help...If I remember right from the gatk-Workshop(@Geraldine_VdAuwera) they change the philosophy to GenoType GVCFs in GATK4. So you just use: MergeVCFs ( a PICARD-Tool) first to merge your GVCFs and then run GenotypeGVCFs over the resulting gvcf.
An example how to run MergeVCF:
Shamelessly stolen from the official pipeline
Maybe the tool-description should be altered.
Perform joint genotyping on one or more samples pre-called with HaplotypeCaller
I think it is a leftover from the 3.7 Version
Hope this helps...
Greetings EADG
Thanks for the advice. I've had some difficulty and bugs giving that a try as well unfortunately. It looks like CombineGVCFs has been deprecated in GATK4? I tried using it in 3.7, and got the following bug:
When I tried using picard, I got this error:
I'm assuming this is because the output from HaplotypeCaller (at least using the parameters I had), didn't yield base level calls. I tried again to use the 3.7 CombineGVCFs with the --convertToBasePairResolution flag but got the same error as calling it without.
Thanks again for your help!
@scottyler89 You understood correctly that in GATK4, GenotypeGVCFs only takes a single input. And indeed, CombineGVCFs is gone, because it was a horribly inefficient tool. Instead, we have a tool called GenomicsDBImport that takes in all your GVCFs and produces a database (really a directory with a bunch of files) that you can then provide as input to GenotypeGVCFs. See this document: https://software.broadinstitute.org/gatk/documentation/article?id=10061
The error you got with 3.7 reminds me of a bug that was fixed a little while ago so you might have better luck with a more recent nightly build, or the 3.8 version which I'm trying to get out right now (having a few technical difficulties but it might be ready by the time you read this). But the GATK4 version is better anyway so if you're willing to upgrade while it's still in beta, I would recommend using that.
By the way, what @EADG described is the procedure for merging GVCFs produced from the same sample by scatter over genomic intervals. It will not work to prepare multiple sample GVCFs for input to GenotypeGVCFs.
Thanks for the help Geraldine! I gave the beta a try with GenomicsDBImport, and ended up getting an error unfortunately.
I also tried running CombineGVCFs in a more recent nightly build, but got a similar lib4j2 related error. Same with the production version of 3.8
Thanks again for all your help!
Ah, the log4j thing is probably a red herring -- that's a minor logging thing that shouldn't cause the run to actually fail, it just pollutes the log output (I think I saw some chatter about getting that fixed).
Possibly more relevant, the program is failing in a function that is supposed to read an index file, and one of the error types along the way is
java.io.EOFException
which translates to "end of file" -- this is generally seen when you have a file that is corrupted or incomplete. I would recommend checking the index files for all your GVCFs. Maybe try running on just one that you know is valid (eg because you can run a different tool on it, like ValidateVariants) and see if that works ok. If so it's a matter of checking all your files for a rotten index, and regenerating the bad file(s).I could be wrong but that seems the most likely problem/solution based on these errors.
(which I realize I did not catch in your original message; sorry about that)
Thanks Geraldine - I'll check those. It may also explain why CombineGVCFs worked when I tried it on a smaller subset. I may have just arbitrarily excluded the VCF with culprit index. I'll let you know if that fixes it. Thanks again
The VCF was the problem! Thanks so much for your help. I just re-did the variant calling on the one sample that was causing problems, and got my pipeline up and running again.
Best,
Scott
Wheee, excellent, we love to hear about problems being solved