Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK 3.6 always re-creates a tribble index for input VCF file instead of just reading it

mmokrejsmmokrejs Czech RepublicMember

Hi,
like other I get some of my cluster jobs crashing with:

ERROR --
ERROR stack trace

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:176)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:375)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:359)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:319)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:264)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:153)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1047)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:824)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:282)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:172)
... 14 more
Caused by: java.io.EOFException
at htsjdk.tribble.util.LittleEndianInputStream.readFully(LittleEndianInputStream.java:138)
at htsjdk.tribble.util.LittleEndianInputStream.readLong(LittleEndianInputStream.java:80)
at htsjdk.tribble.index.linear.LinearIndex$ChrIndex.read(LinearIndex.java:271)
at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:364)
at htsjdk.tribble.index.linear.LinearIndex.(LinearIndex.java:101)
... 19 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: java.lang.reflect.InvocationTargetException
ERROR ------------------------------------------------------------------------------------------

I suspect the problem is that GATK just carelessly re-creates the index file instead of just checking whether it is already existing or not. Would you please confirm/disprove this theory by checking the code? I assume my other jobs are running because they managed to re-create and open their index in time and have opened file handle. Maybe the issue is just a problem when two processes try to re-create the index file at the same time.

RMDTrackBuilder - Writing Tribble index to disk for file grch38/cosmic/v79/VCF/CosmicCodingMuts__broad_compatible.resorted.vcf.idx

Otherwise, I do not understand why my CosmicCodingMuts__broad_compatible.resorted.vcf.idx has continually updated timestamp.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mmokrejs, the code specifically checks whether there is already an index available, and only re-creates one if the file is not available or cannot be read. It is possible that if you have parallel processes both trying to access the input file and generating an index for it, that would cause errors. Try running a single process that takes that file as input and see if the index gets generated correctly.

  • mmokrejsmmokrejs Czech RepublicMember
    edited December 2016

    Thank you for your answer. That does not help in my case. I do runa single process per sample but that is not what causing teh problem here. Each of the threads is using same input VCF file to learn the annotation from, so I cannot/won't even try.

    Please introduce a file locking mechanism, like the following:

    if [ -e $index ]; then
    if [ ! -e "$index".lock]; then
    touch "$index".lock || exit 255
    open "$index".lock # in exclusive mode
    if $successfull; then
    create_new_index($index)
    rm "$index".lock

    IMHO something like this should be done in this way by GATK everywhere possible.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    There is a file locking mechanism but it has caused us a lot of trouble over time -- we're actually getting rid of it in GATK4 and simply disallowing running on unindexed files. To compensate we'll have an indexing tool that can index any supported format appropriately, to be run as a prelude to analysis. This should resolve the common problems we've seen related to multiple jobs using the same resource or input file.
  • mmokrejsmmokrejs Czech RepublicMember
    edited December 2016

    Well, a clean exit is much better then curent situation. But then I advise to disable the broken index-calling functions right away in GATK-3.7, as you seem to confirm the broken functionality as of now.

    Post edited by mmokrejs on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Well, it's not that simple. Lots of people's pipelines depend on the current functionality, which seems to work in the majority of cases. So it's difficult to justify breaking continuity to solve a minority problem. GATK4 will introduce several sources of disruption that are unavoidable so that's where we feel we can afford to make major changes to established behavior. Until then we have to live with the limitations of the current system.
  • mmokrejsmmokrejs Czech RepublicMember

    This is a fatal flaw IMHO that each job modifies a file other taks currently use (they should not touch the already existing index of a COSMIC VCF file at first). I do not know what format the index has but it apepars the jobs fail only if the index file has technically incomplete (yet unwritten) structure. Whether the jobs can realize the index is incomplete (header complete while the rest not complete) I cannot judge but that would be the worst thing. I would not like to imagine that only a part of the COSMIC VCF file was used in reality.

Sign In or Register to comment.