The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Can we implement GATK/Queue on google hadoop?

Member Posts: 7

Hello, Im new to GATK and Queue. I understand that we can write a QScript in Queue to generate separate GATK jobs and run them on a cluster of several nodes. Can we implement GATK or Queue on google hadoop?

Tagged:

## Answers

• Member Posts: 7

It seems that implementing GATK on hadoop requires tons of work

• Charlestown, MAMember Posts: 274 admin
edited June 2013

yes and no. The GATK wasn't implemented with hadoop in mind, this is only for historical reasons.

One could envision a full reimplementation of the engine to handle a HDFS and making -nt / -nct work transparently in a hadoop framework. This is not "a lot of work" but it's work that requires deep knowledge of the intrinsics of the GATK. Right now we don't have the resources to implement this ourselves, or provide the level o support it would be necessary to have someone else do it.

On the other hand, like Queue, one could implement a wrapper around the GATK to instantiate it in a hadoop cluster. This is not a lot of work at all, in fact, there are people already thinking about this problem outside our group. Unfortunately our resources are very limited but this alternative should require much less understanding of the GATK engine and is probably feasible for a good software engineer to tackle.

• Durham, NCMember Posts: 12

Hello, some of us at Duke University, along with the person who posted original question, are thinking of working on writing a wrapper around the GATK to be able to use it on a Hadoop cluster. Before we started, we wanted to get some feedback on the utility and feasibility of creating such a wrapper. Could you please provide any feedback/thoughts on this, such as the potential performance advantage, challenges in writing software, amount of background work we would have to do to understand the GATK code base etc. Thanks for your input!

• Cambridge, MAMember, Administrator, Broadie Posts: 11,413 admin

Hi @pagarwal14,

To be honest this is not something we have given a lot of thought to, and right now we can't spare the resources to look at it with the seriousness needed to fully answer your questions. One important caveat is that our developer-oriented documentation is rather sparse at the moment, so that may be the biggest stumbling block; we aim to deal with that issue progressively over the next few months, but in the meantime we will not be able to offer you much support toward grokking the GATK codebase.

That being said, I hope this does not deter you from undertaking this project, as there seems to be some demand for this and there should not be any unreasonable technical difficulty involved. Good luck!

Geraldine Van der Auwera, PhD

• Durham, NCMember Posts: 12

Thank you for your response. Can you point us to the code base and the location for the developer-oriented documentation as it exists today. I searched around on the website for the documentation and the closest I could find was at http://www.broadinstitute.org/gatk/guide/topic?name=developer-zone. Is that all of the developer documentation or is there a more consolidated document. Thanks.

• Cambridge, MAMember, Administrator, Broadie Posts: 11,413 admin

Hi there,

You can get the source code of the full GATK on https://github.com/broadgsa/gatk-protected (which has a restrictive license) or the framework only on https://github.com/broadgsa/gatk (which is MIT-licensed).

I'm afraid the "Developer Zone" is indeed all we have for dev docs right now, aside from the code javadocs of course.

Geraldine Van der Auwera, PhD

• ChinaMember Posts: 2

May I ask what is the progress? thanks a lot!

• Cambridge, MAMember, Administrator, Broadie Posts: 11,413 admin

We are now looking at technologies other than Hadoop.

Geraldine Van der Auwera, PhD

• Member Posts: 1

what is the new technologies you guys are looking at? I am curious about the progress. currently, we have a project that wants to use hadoop and gatk.

• Cambridge, MAMember, Administrator, Broadie Posts: 11,413 admin

Clouds and Spark. For more info, see these two links:

Geraldine Van der Auwera, PhD

Sign In or Register to comment.