Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

my god

mforde84mforde84 chicagoMember

ERROR MESSAGE: Timeout of 30000 milliseconds was reached while trying to acquire a lock on file /glusterfs/data/conte_datasets/NDAR/Wigler_data_full/genomics_sample03/D056HACXX-4-BG1.A_1_fq.gz.bam.sort.bam.real.bam.HC.SNP.INDEL.vcf.idx. Since the GATK uses non-blocking lock acquisition calls that are not supposed to wait, this implies a problem with the file locking support in your operating system.

I mean wut?

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mforde84
    Hello,

    The way you phrase your questions are making me laugh at my desk.
    Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/1252/unifiedgenotyper-gets-hung-waiting-for-file-lock

    -Sheila

  • mforde84mforde84 chicagoMember
    edited July 2015

    Yes, and it's just problem after problem with GATK. I mean I'm working on an enterprise level cloud here, I don't see the answer " In some rare cases the protocols used by the GATK and by the operating system are not compatible", as a reasonable explanation, especially when all I'm trying to do is merge vcf files. When I try "--disable_auto_index_creation_and_locking_when_reading_rods" I get memory errors ... Fantastic, so I'm assuming now I have to dump more memory into java, ok .. how much? ... no one knows ... while it just sits there eating up less than a 10th of my compute cycles?! Then, most likely, get additional java errors that read like complete gibberish.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mforde84,

    I understand your frustration, but you could help us to help you by telling us what you're actually trying to do -- what tool are you running, on what sort of dataset, and so on, as well as the technical specifications of the platform you're trying to run on. The terms "enterprise level cloud" are not exactly enlightening in this regard.

    Keep in mind also that GATK was designed to run on traditional filesystems, so it's not so surprising to run into problems with some of the newer cloud-based infrastructures. The differences between these infrastructures are substantial, which is why we make no claims regarding the possibility of running GATK on cloud platforms. The somewhat good news is that we are currently in the middle of a big push to rewrite substantial parts of the GATK engine to make it run natively on cloud platforms. As you may have heard in a recent announcement, we are collaborating with Google Genomics to make this happen. But of course this will take some time. In the meantime, my team will do their best to help you.

  • mforde84mforde84 chicagoMember

    So no more Java? I can dream, can't I?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mforde84
    Ha. No, There is definitely more Java :smile:

    -Sheila

  • mforde84mforde84 chicagoMember
    edited July 2015

    Please, no. Why are you continuing to develop this in java? I mean really, theres no good reason for it. Everyone can compile C. This is the reason why I never use gatk, unless someone specifically asks me to use it. I'm not in the minority here either. gatk is horrible with memory and cpu management primarily because it written on top of a heavy and obfuscated coding language. Procedural code, not object oriented.

  • pdexheimerpdexheimer Member ✭✭✭✭

    You know, I kind of had similar thoughts when I first came across GATK years ago. Maybe not quite so self-righteous, but I definitely remember thinking things like "Why did they cripple themselves by using Java? It's big, slow, verbose, inefficient... For such big data, why wouldn't they use C++?"

    But the thing is, if you actually use it, it's pretty amazingly quick and efficient. Java is not the painful, inefficient language it was 15 years ago. The vast majority of GATK analyses can run in 4GB of RAM or less, despite the fact that they're operating on files that are frequently in the dozens or hundreds of gigabytes. The tradeoff (because there always is one), is that the file I/O demands are very high. I saw a 30-50% reduction in runtimes when my institution upgraded our storage infrastructure - so yes, if you have slow I/O, your cpu usage will be suboptimal because it's always sitting idle waiting for the data.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For the record, we tried rewriting the GATK in C++ last year as an internal project. It turned out that even once the developers had gotten over the hurdle of getting used to C++, method development was still much slower than in Java. That was a dealbreaker for us. But the rewritten engine code is still around as a fully open-source project if you want to go and re-implement the whole thing yourself. https://github.com/broadinstitute/gamgee

  • evolvedmicrobeevolvedmicrobe MGHMember

    @mforede84 "Everyone can compile C" is certainly not true in my, and many others, experience. C has a lot of undefined behavior, and the compatibility problems can explode to all consuming in C++ (I know because I personally spend a lot of time dealing with incompatibilities between clang/gcc, C++11 versus C++98, make versus cmake versus autoconf ). Java has for a long time been considered a premier language for "write once, run anywhere" and it's ability to scale horizontally is considered a big advantage among many people. Many large projects also find through profiling that their code is slow not because of missing optimizations like avoiding object headers, but because they failed to implement algorithmic and processing improvements due to the difficulty of refactoring. Java is really easy to run and debug, if you at present think C is easier, I might suggest it's worth taking another look to gain more familiarity with Java.

  • mforde84mforde84 chicagoMember
    edited July 2015

    Java will never scale as well as native code execution primarily because it's virtualizing JRE, it doesn't like in-memory caches (I have a terabyte of memory to work with... stop writing everything to disk), and many java programs have poorly optimized GC. Certainly, it can be optimized, but were not talking about Hadoop here, we're talking about GATK where it takes an hour to write a tribble index compared to 2 minutes using tabix. That's a pressing issue for any end user, whether on a desktop or a cloud. Also, this is just a general complaint, the way java handles multithreading... my god. Ok, I get it, people still use Java, but let's be honest, it's a legacy language. Not saying that's a bad thing. But most things Java does, another framework was specifically designed to do better.

  • KlausNZKlausNZ Member ✭✭

    @mforde84, in case it helps, we see the message in your original post relatively frequently. We have tried hard to resolve it, but found it unrelated to GATK parameters, file size, compression state, or any other influences we can exert on the GATK. The same GATK command operating on the same files may result in that message at a given time, but not a few minutes or hours later. Or vice versa,
    In contrast, we found this error message to be highly correlated to overall I/O load on the cluster file system (large university cluster). In fact, it's so sensitive to overall fs condition that we've considered running GATK constantly to 'monitor' fs health (or keep tabs on what the climate modellers are up to ;-).

    @Geraldine, so sad to learn my dream of GATK in C is over...

  • mforde84mforde84 chicagoMember

    yea, glusterfs might be the issue. but typically only a few of us are using this server at any given time. we migrated most people to a new cinderblock based setup, but, now that i think of it, i remember top on the headnode showed a user performing a large scp transfer, so maybe thats the issue. it probably makes sense to compare performance on the new system. thanks for the idea. now see thats useful.

Sign In or Register to comment.