Hello, I`m new to GATK and Queue. I understand that we can write a QScript in Queue to generate separate GATK jobs and run them on a cluster of several nodes. Can we implement GATK or Queue on google hadoop?
It seems that implementing GATK on hadoop requires tons of work
yes and no. The GATK wasn't implemented with hadoop in mind, this is only for historical reasons.
One could envision a full reimplementation of the engine to handle a HDFS and making -nt / -nct work transparently in a hadoop framework. This is not "a lot of work" but it's work that requires deep knowledge of the intrinsics of the GATK. Right now we don't have the resources to implement this ourselves, or provide the level o support it would be necessary to have someone else do it.
On the other hand, like Queue, one could implement a wrapper around the GATK to instantiate it in a hadoop cluster. This is not a lot of work at all, in fact, there are people already thinking about this problem outside our group. Unfortunately our resources are very limited but this alternative should require much less understanding of the GATK engine and is probably feasible for a good software engineer to tackle.
Hello, some of us at Duke University, along with the person who posted original question, are thinking of working on writing a wrapper around the GATK to be able to use it on a Hadoop cluster. Before we started, we wanted to get some feedback on the utility and feasibility of creating such a wrapper. Could you please provide any feedback/thoughts on this, such as the potential performance advantage, challenges in writing software, amount of background work we would have to do to understand the GATK code base etc. Thanks for your input!
To be honest this is not something we have given a lot of thought to, and right now we can't spare the resources to look at it with the seriousness needed to fully answer your questions. One important caveat is that our developer-oriented documentation is rather sparse at the moment, so that may be the biggest stumbling block; we aim to deal with that issue progressively over the next few months, but in the meantime we will not be able to offer you much support toward grokking the GATK codebase.
That being said, I hope this does not deter you from undertaking this project, as there seems to be some demand for this and there should not be any unreasonable technical difficulty involved. Good luck!
Thank you for your response. Can you point us to the code base and the location for the developer-oriented documentation as it exists today. I searched around on the website for the documentation and the closest I could find was at http://www.broadinstitute.org/gatk/guide/topic?name=developer-zone. Is that all of the developer documentation or is there a more consolidated document. Thanks.
You can get the source code of the full GATK on https://github.com/broadgsa/gatk-protected (which has a restrictive license) or the framework only on https://github.com/broadgsa/gatk (which is MIT-licensed).
I'm afraid the "Developer Zone" is indeed all we have for dev docs right now, aside from the code javadocs of course.
May I ask what is the progress? thanks a lot!
We are now looking at technologies other than Hadoop.
what is the new technologies you guys are looking at? I am curious about the progress. currently, we have a project that wants to use hadoop and gatk.
Clouds and Spark. For more info, see these two links: