We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Version highlights for GATK version 3.8

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited July 2017 in Announcements

One more 3.x version, for the road! That's right, even as we're ramping up our efforts on GATK4 (we're three beta releases in at this point, and getting down to brass tacks writing the migration guide ahead of the 4.0 general release) we still found it worthwhile to cut one last release of GATK3.

Our main motivation here is to introduce the Intel Genomics Kernel Library, which comes bearing the gift of speed improvements for those of you who won't be able to migrate to GATK4 right away.

As a secondary benefit, this version includes a handful of bug fixes, some usability improvements including better error messages, documentation fixes and logging tweaks, and a few improvements to annotation calculations (especially in allele-specific mode), which you'll find described briefly in the release notes. No big changes though, except perhaps the new default behavior of VariantsToTable with regard to missing annotation values, discussed below. Finally, we've committed a copy of all the peripheral documentation (= the docs that live in the forum and complement the tool documentation) to the now-old GATK codebase.

And thus, the last-ever GATK3 version emerges covered in carbonite.

Introducing the Intel Genomics Kernel Library

The Genomics Kernel Library or GKL is an open-source library developed by our collaborators at Intel that provides accelerated versions of algorithms, i.e. "kernels", used in genomics tools. These kernels are optimized to run on Intel Architecture under 64-bit Linux and Mac OSX. They're plugged into the GATK in such a way that they will be automatically used if your computing hardware supports them, but if it doesn't they will remain inactive and the "default" generic Java versions will be used instead.

At the moment there are three main kernels included:

  • Intel inflater/deflater: a file compression/decompression kernel that provides different levels of compression (with correspondingly variable speedups). This replaces the JDK inflater/deflater and is now activated by default. It can be disabled by using the -jdk_deflater and -jdk_inflater flags.

  • Intel chip optimization for PairHMM: a version of the PairHMM algorithm used by HaplotypeCaller to calculate genotype likelihoods that runs faster on Intel hardware. It can be disabled by setting -pairHMM LOGLESS_CACHING, for example if you need completely deterministic behavior across different machine types (at the expense, of course, of speed).

  • FPGA support for PairHMMM: another version of the PairHMM algorithm, this one designed to run on FPGAs, which are a type of processor that is gaining popularity for computing applications that require extremely high speed. The FPGA support in this version is fairly experimental so we can't guarantee results, but if you have access to this specialized hardware we definitely encourage you to try it out and let us know how it goes.

Attitude adjustment for VariantsToTable

VariantsToTable is a tool we're quite fond of because it allows us to extract just the information we want from VCFs when we want to probe a callset interactively, typically for filtering purposes. Previously we had to tell it explicitly not to freak out if it came across any sites or genotypes where an annotation we requested was missing; but realistically, there are always some sites for which we can't calculate some annotations (like ranksum annotations at sites where we don't have any heterozygous samples), so that was annoying. Now we've flipped the behavior so that by default the tool keeps going and just outputs "NA" anywhere it encounters such sites or genotypes, unless you specify that it should freak out by using the --errorIfMissingData flag.

Documentation archive and deprecation plans

In preparation for the general release of GATK4 (in the form of a 4.0 version), we made a copy of all the peripheral (forum-based) documentation in its current state and archived it in the codebase itself here. This is intended to be a permanent archive for documentation that we are phasing out in favor of GATK4-focused documentation.

Our ultimate goal is to provide some degree of continuity and support for users who cannot migrate to GATK4 right away and must continue to use older versions, without leaving too much clutter around that might confuse everyone else.

In the immediate future we will delete three sets of documents from the forum (and therefore from the website):

  • "Developer Zone": replaced in GATK4 by a developer-oriented Wiki in the github repository;
  • "Queue": superseded for all versions by Cromwell+WDL;
  • The current contents of "Archive", which have typically been replaced by individual articles linked at the top of the deprecated article.

Within the other documentation sections, articles may get updated in place or moved to the Archive for future removal. Versioned tool documentation going back to 3.5-0 will remain available on the website for the foreseeable future. For older versions, the documentation can be built from source. Finally, the Best Practices section of the website will be updated to reflect the new world order once GATK 4.0 is released and becomes the officially supported version of GATK. Going forward we'll have versioned Best Practices accompanied by a publicly available WDL script for each major use case. We'll post more details of what this will look like in the coming weeks.


  • EADGEADG KielMember ✭✭✭


    is there are list which intel-cpu will support GKL ?
    I allways need a reason for my boss to buy new hardware ;)

    Greetings EADG

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hah, no kidding :D

    I'm not aware of any such list but if you're interested I can put you in touch with the folks at Intel who can best tell you that.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Any AVX compatible Intel CPU (Sandybridge Sandybridge EP Core i3 i5 i7 /Xeon E3 E5... and above) should do a decent acceleration I think. I have seen a nice boost after 3.8 even when I don't use multithreading in most of my workflow (I don't use multithreading other than BWA and BQSR because I need the bamout in HC and I noticed that (with my humble testing of course YMMV) concurrent sample workflows are faster than multithreading a single sample with all you have. [4 WES samples are completed with annotation and all QC extras in 5 hours on average])

  • EADGEADG KielMember ✭✭✭

    Hi @SkyWarrior, @Geraldine_VdAuwera

    4 WES samples in 5 hours that sound fast, can you give me a ruff description of the system which you are using (cpu/mem) ?

    That would be nice, even it would be interesting which intel cpu supporting FPGA right now.

    Another question is if Mutect2 also profit from the faster PairHMM calculation.

    Greetings EADG

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited August 2017

    Hi @EADG

    I am using Skylake-X i9 7900X 128GB ram. My genome and reference vcf files are on M.2 NVMe SSD and my scratch disk is a 8TB 256mb cache 7200RPM spinner. Ubuntu 17.04 and all the regular stuff is loaded.

    I am running maximum of 4 threads per workflow and I run 4 workflows concurrently. This setup finishes 50-60X WES samples 4 samples per 5 hours and 4 samples per 6-7 hours for 100X WES samples. I can shorten this duration about an hour and half but that time is usually used to collect data per sample for more advanced analysis stuff like CNV etc...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes MuTect2 does benefit from the acceleration of PairHMM.

  • FatenFaten MalaysiaMember

    Hi @Geraldine_VdAuwera,

    I run HaplotyCaller in this GATK version 3.8, may i know if there is no longer stand_emit_conf anymore?

    I try running using command : java -jar GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R hg38.fa -I sorted_RG_dedup_mark2.bam -stand_emit_conf 10 -stand_call_conf 30 -o variants.vcf

    Error showing: ##### ERROR MESSAGE: Invalid command line: The parameter standard_min_confidence_threshold_for_emitting is deprecated. This argument is no longer used in GATK versions 3.7 and newer. Please see the online documentation for the latest usage recommendations.

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    The stand_emit_conf is indeed no longer an option in 3.8. You can only use stand_call_conf. Have a look at this post for more information.


  • MattBMattB NewcastleMember ✭✭

    Hi @Geraldine_VdAuwera and @Sheila, re the doc_archive in GitHub do you think you could archive the tool documentation? E.g. pages like this. Just thinking that perhaps some of those arguments and defaults will be changing with the move over to 4.x and it would be good to have the old 3.x ones archived in their final state as of 3.8 along with the other docs.

  • MattBMattB NewcastleMember ✭✭

    Ah I've just seen the dropdown here which I'd yet to see because I always google the name of the tool, it might be useful to push that orange dropdown to the actual tool documentation pages, so people are aware the documentation is versioned when directly landing on pages.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @MattB,

    Thanks for voicing you'll need access to old 3.x documentation. Rest assured, we also think it is prudent to keep this documentation around and so we are planning to differentiate 3.x and 4.x documentation via different subforums, much like we do for WDL and FireCloud.

    We will also keep the orange dropdown to track changes to minor versions for provenance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To be clear, the command-line tool docs will continue to be presented as they are today, though we can certainly improve the visibility of the versioning information.

    Regarding the "peripheral" documentation, to elaborate a bit on @shlee's comment, we aim to provide clear distinction, during a forthcoming transition period, between documents that we update for use with GATK 4 (and/or remain equally applicable across versions) vs. documents that only apply to GATK 3 and older versions, which will eventually be archived and deprecated. Some details remain to be determined, but our goal here is to minimize confusion and friction, as much as humanly possible.

  • datakiddatakid At my deskMember

    What is the expected EoL for 3.8?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    There will be no further 3.8-x releases (no more code changes, bug fixes etc) and we aim to discontinue support for any new work by Dec 31 2019 -- so starting Jan 1 2020 we expect all new work to be done with a 4.x version. However we'll still answer questions about results that were previously obtained with a 3.x version (we're not monsters).

  • nice very article

Sign In or Register to comment.