Writing custom tools: htsjdk vs picard vs GATK

Hi fellow htsjdk/picard/gatk developers!

I've been thinking about this for quite some time now, so I thought I should write up a quick post about it here.

I've been writing custom tools for our group using both picard and GATK for some time now. It's been working nicely, but I have been missing a set of basic tutorials and examples, for users to quickly get started writing walkers. My most commonly used reference has been the 20-line life savers (http://www.slideshare.net/danbolser/20line-lifesavers-coding-simple-solutions-in-the-gatk) which is getting a bit dated.

What I would like to see is something like for following:

  • What's in htsjsk? What's not in htsjdk? (from a dev's perspective - in terms of frameworks)
  • What's in picard? What's not in picard? (from a dev's perspective - in terms of frameworks)
  • What's in gatk? What's not in gatk? (from a dev's perspective - in terms of frameworks)
  • When to use htsjdk, picard any GATK. What are the strengths and weaknesses of the three. (possibly more that I've missed)
  • Your first htsjdk walker
  • Your first picard walker
  • Your first gatk walker
  • Traversing a BAM in htsjdk vs gatk - what are the differences

There might be more stuff that could go in here as well. The driving force behind this is that I'm myself a bit confused by the overlap of these three packages/frameworks. I do understand that picard uses htsjdk, and that GATK uses both as dependencies, but it's not super clear what extra functionality (for a developer) is added from htsjdk -> picard -> gatk.

Could we assemble a small group of interested developers to contribute to this? We could set up a git repo with the examples and tutorials for easy collaboration and sharing online.

Anyone interested? I'll could myself as the first member :)

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Dan,

    We've been thinking about this a lot on our end as well. At the risk of committing this, we've been involved in developing a unified framework that includes the functionalities of all three, without the legacy issues, and fully open-source. Mauricio @Carneiro‌ has been leading the charge on this and will be giving a talk about it at CPPCON on September 9th. If you'd like to read more about it in the meantime the project, which is called Gamgee (because SAM...), is accessible here.

  • dklevebringdklevebring Member
    edited August 2014

    Hmm...

    To be honest, I don't have a problem with the overlap of htsjdk and gatk. Both serve good puposes, with relevant differences. I would just like them to be clearly stated, and come with some minor examples.

    Gamgee seems to use htslib, so the same could be said there.

    What the driving force behind gamgee? Speed? More focused code base?

    Personally I like the java code base. My motivation behind this post was not to argue that we need a new standard (yeah, that xkcd is a good one) but rather that I'd like to see a lower threshold for developers. I understand you super busy, so whatever I can do to help...

    Post edited by dklevebring on
  • Let's extend the list then.

    For a developer in this field, which is fast moving, a brief summary how the following frameworks/libraries/tools belong would be useful.

    • htsjdk
    • picard
    • gatk
    • htslib
    • samtools
    • gamgee

    Below is a short text about each of these frameworks/tools as I understand them. Feel free to comment and alter this so that it's closer to reality.

    htsjdk

    Repo: https://github.com/samtools/htsjdk

    Java-framework used by picard and GATK to access reads in bam files (no pileups) and variants in VCF files

    picard

    Repo: https://github.com/broadinstitute/picard

    Java tool set using htsjdk to perform tasks on fastq, sam, bam and vcf files. Loop-based code, so no (easy) multiprocessor support.

    GATK-framework

    Repo: https://github.com/broadgsa/gatk

    Java-framework further extending htsjdk with capabilities such as accessing pileups. Defines a set of Walkers for developers to extend to traverse bam or vcf files. These include RefWalker, RodWalker, LocusWalker and the lesser-used ReadPairWalker and DuplicateWalker. (did I forget any). The framework uses map-reduce functionality to for multiprocessor support. A good start for writing walkers is http://www.slideshare.net/danbolser/20line-lifesavers-coding-simple-solutions-in-the-gatk, which gives a good overview of the most common walkers even if a few details are outdated.

    GATK-tools

    Repo: https://github.com/broadgsa/gatk-protected

    A set of tools using the GATK-framework for analysis of bam and vcf files (mostly).

    htslib

    Repo: https://github.com/broadinstitute/htslib

    C-framework for manipulation of BAM/CRAM and VCF files.

    samtools

    Repo: https://github.com/samtools/samtools.git

    Tool set built using htslib for manipulation of sam files (mostly).

    gamgee

    Repo: https://github.com/broadinstitute/gamgee

    C-framework further extending htslib. ( gamgee is to htslib what GATK-framework is to htsjdk - in broad terms? )

Sign In or Register to comment.