Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

GATK4 is completely open source

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited May 2017 in Announcements

This is one of two posts announcing the imminent beta release of GATK4; for a technical description of features, see this other post.

image

"Wait, what?" Yes, you read that right, we're moving GATK4 to a fully open source license -- specifically, BSD 3-clause. And to be clear, this applies to all of GATK4. Not just the core framework (which, little known fact, has always been open source), but all the tools that were previously "protected", including HaplotypeCaller, the new CNV discovery tools, everything. The whole enchilada.


Old-timers in the field (i.e. anyone with what, 3+ years experience?) will recognize this as a major shift. An important subset of the GATK -- some might say "all the really valuable bits" -- has been under a mixed licensing model since version 2.0 was released in 2012. Under this mixed model, GATK was free for academic/non-profit research purposes, while any for-profit use required a paid commercial license. The proceeds funded further GATK development and support.

Admittedly the move from the initial open-source state of GATK 1.x to the mixed licensing model caused a fair amount of debate. I'm not going to revisit in full (even my therapist is sick of hearing about it), but it's fair to say that the licensing created an obstacle for our interactions with some other groups, and that it raised some barriers to access to GATK, especially for smaller companies and startups.

Since then the context within which we operate at the Broad has evolved significantly: a little over two years ago, our small development team was assimilated into a then-newly created larger group called the Data Sciences Platform (DSP), which aims to tackle the big challenges in genomics with robust engineering solutions. This involves applying some novel approaches compared to traditional academic software development, including: 1) give engineers a good home; 2) focus on products, not projects; and 3) maximize openness. This last point in particular means that our DSP mothership-within-Broad recognizes the immense potentiating role of open-source software in driving technological and methodological innovation. In fact, all of DSP's software products have been open-source since its inception, with the notable exception of GATK, which it inherited in a mixed state.

Over the past two years, the collaborations that DSP has cultivated with external groups have immensely benefitted the development of the new framework that would eventually become GATK4. Key features that we have come to rely on were contributed as open-source code by external collaborators: the GenomicsDB datastore that allows us to scale joint genotyping to tens of thousands of whole genomes, by Karthik Gururaj and colleagues at Intel; the Genomics Kernel Library, which provides many impressive speedups for the GATK, by George Powley at Intel; the NIO functionality that allows us to access data on Google Cloud Storage directly, by JP Martin at Google; and the Apache Spark support that allows us to parallelize operations in a much more robust way than before, by Tom White at Cloudera. And it's not all about institutional collaborations; we have also received spontaneous contributions from individuals such as Daniel Gómez-Sánchez of the Institut für Populationsgenetik of Vienna, which have collectively enhanced the GATK codebase and its value to the user community.

So with GATK4 on the cusp of release, and with enthusiasm from all of us at the Broad, we're seizing this opportunity to do a reboot* and bring into alignment our mandate (to build great software), our mission (to empower great research) and our means: a more community-minded approach anchored in openness and free exchange of ideas.

* (at least we had already ditched Jar-Jar "Phone Home" Binks...)

I expect the benefits of this new direction are fairly self-evident, so I'll do us all a favor and close with just one last, somewhat personal note specifically from the development team. We want to thank all the collaborators who have worked with us so far for their support, their invaluable contributions and their faith in what we could accomplish together. And as we turn over this new leaf, we look forward to welcoming into the GATK family anyone who would like to see how much further we can push the genomics envelope.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

Sign In or Register to comment.