We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Using the GATK API to annotate my VCF

I just quickly wrote a set of Tools to annotate my VCFs ( http://plindenbaum.blogspot.fr/2013/02/4-tools-i-wrote-today-to-annotate-vcf.html )
For example, one of those tools uses a BED/XML file indexed with tabix to annotate my VCF . (My code just uses the java api for tabix to get the XML at a given position)
Question: is there something in the GATK-API that would allow me to implement my code using the GATK-API: What kind of walker should I use ? What would be the benefits of using the GATK-API ? for example does using a gatk-walker will automatically make my code parallelizable ?
Pierre
Best Answers
-
Geraldine_VdAuwera Cambridge, MA admin
Hi Pierre,
For VCF annotation, the easiest would be to write an annotation module for VariantAnnotator. I would recommend choosing one of the existing modules in
org.broadinstitute.sting.gatk.walkers.annotator
that looks most like what you want to achieve as a starting point. If you want to annotate based on an external resource file (as opposed to internal calculations) then have a look at the SnpEff annotator module as an example.Using VariantAnnotator will indeed make your code parallelizable with the
-nt
argument. If you're writing your own walkers, you can make them parallelizable just by implementing the appropriate interface. There are a lot of other benefits to using the GATK framework, depending on what you're interested in doing... -
Mark_DePristo Broad Institute admin
Pierre, we'd be quite interested in your experiences with the GATK, and what benefits you find programming in it. Obviously we've all drank the kool-aid already, so it's nice to get some external prospective. In my mind, there are a few key benefits:
-- The engine is highly validated. Many complex IO operations (loading 1000 BAM files and 50 input VCF files) are correct, efficient, and low-memory. Performing efficient random access on 1000 BAM files in 2G of RAM is hard to do correctly, and using the GATK engine simply gives you all of this for free.
-- The Map/Reduce architecture forces you to program in a simple way without too much control over your IO. A surprisingly large amount of code ends up reading, writing, and managing input and output data. With the GATK this is largely abstracted way, and honestly out of your control, so you cannot do anything that may be bad for you in the long run.
-- Because you don't have control of your IO or your data access patterns, the GATK can pretty much automagically parallelize your computation for you. Simply implementing the TreeReducible interface means you support -nt parallelism. Implement NanoSchedulable and you can also use -nct with your walker. If you implement both you can do both at the same time. GATK queue can scatter/gather your code automatically, if you have a standard output type, or you just need to implement a simple gather function in Queue.
-- Finally, there are lots of library code doing quite complex things (smith waterman, doing multi-sample exact calculation from genotype likelihoods, realigning reads to indels, left aligning indels) that make you more productive, in the GATK library itself.
In my view, people often take shortcuts by building little programs quickly in languages with perl or python against some of the very nice libraries like BioPerl and BioPython (I've done this many times myself). You often have a very good little program that does the one thing you want it to, with the specific scaling needs (often run once, on one small dataset), assuming a specific input structure. As soon as you want to generalizing that program, scale to it larger data sets, or allow multiple (or even a dynamic number) input files in many formats you basically collapse under the weight of all of the IO systems you need to build. The GATK is basically our solution to this problem. It's much harder to write programs in the GATK at the start -- the barrier to entry is high -- but you long-term productivity by not having to deal with the hard problems of IO and scaling efficiency.
Answers
Hi Pierre,
For VCF annotation, the easiest would be to write an annotation module for VariantAnnotator. I would recommend choosing one of the existing modules in
org.broadinstitute.sting.gatk.walkers.annotator
that looks most like what you want to achieve as a starting point. If you want to annotate based on an external resource file (as opposed to internal calculations) then have a look at the SnpEff annotator module as an example.Using VariantAnnotator will indeed make your code parallelizable with the
-nt
argument. If you're writing your own walkers, you can make them parallelizable just by implementing the appropriate interface. There are a lot of other benefits to using the GATK framework, depending on what you're interested in doing...Pierre, we'd be quite interested in your experiences with the GATK, and what benefits you find programming in it. Obviously we've all drank the kool-aid already, so it's nice to get some external prospective. In my mind, there are a few key benefits:
-- The engine is highly validated. Many complex IO operations (loading 1000 BAM files and 50 input VCF files) are correct, efficient, and low-memory. Performing efficient random access on 1000 BAM files in 2G of RAM is hard to do correctly, and using the GATK engine simply gives you all of this for free.
-- The Map/Reduce architecture forces you to program in a simple way without too much control over your IO. A surprisingly large amount of code ends up reading, writing, and managing input and output data. With the GATK this is largely abstracted way, and honestly out of your control, so you cannot do anything that may be bad for you in the long run.
-- Because you don't have control of your IO or your data access patterns, the GATK can pretty much automagically parallelize your computation for you. Simply implementing the TreeReducible interface means you support -nt parallelism. Implement NanoSchedulable and you can also use -nct with your walker. If you implement both you can do both at the same time. GATK queue can scatter/gather your code automatically, if you have a standard output type, or you just need to implement a simple gather function in Queue.
-- Finally, there are lots of library code doing quite complex things (smith waterman, doing multi-sample exact calculation from genotype likelihoods, realigning reads to indels, left aligning indels) that make you more productive, in the GATK library itself.
In my view, people often take shortcuts by building little programs quickly in languages with perl or python against some of the very nice libraries like BioPerl and BioPython (I've done this many times myself). You often have a very good little program that does the one thing you want it to, with the specific scaling needs (often run once, on one small dataset), assuming a specific input structure. As soon as you want to generalizing that program, scale to it larger data sets, or allow multiple (or even a dynamic number) input files in many formats you basically collapse under the weight of all of the IO systems you need to build. The GATK is basically our solution to this problem. It's much harder to write programs in the GATK at the start -- the barrier to entry is high -- but you long-term productivity by not having to deal with the hard problems of IO and scaling efficiency.
Thank you Mark & Geraldine. I think I'm going to play with a API next week. I've already got a bunch of question to ask but I'll try out a few things before asking some stupid questions.