Version highlights for GATK version 3.6

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited June 2016 in Announcements

What better way to start the summer than with a new GATK release?

Umm no don't answer that, there's loads of good options. You could have a barbecue, eat some ice cream, go on a hike if that's the sort of thing that floats your kayak... Or you might live somewhere where winter is just starting and everything I just said there was terribly insensitive. Sorry.

Ahem. As I mentioned in my recent sneak preview blog post, the bulk of our development effort (speed! copy number! unicorns!) is now going into the GATK4 project. Accordingly, development in the GATK3 framework is winding down, so this release consists mainly of bug fixes, added convenience functions, and relatively minor behavior tweaks.

That being said we do have a few new experimental features in the VQSR tools (which haven't yet been fully ported to GATK4, hence the ongoing development in GATK3) that are pushing the envelope of allele-specific filtering. So that's interesting, if not yet fully documented (someone should really get on that). And you'll probably care about some of those tweaks I casually mentioned above -- in fact I guarantee that at least one of these things will matter to you in some way. If you read through the whole thing and don't find anything relevant to you, tell me in the comments that I was wrong. That's what the internet is for.

As usual, here I go over the changes that matter the most / to the most; consider going through the release notes as well for a full list of changes.


One version of Java to rule them all

Possibly the most sweeping change in this version is that it introduces support for Java 8. As noted recently, when we switched our test framework to Java 8 we encountered multiple failures in GATK 3.5 tests, which I discussed here. We fixed the underlying issues, so from version 3.6 onward GATK now runs reliably on Java 8. As a nice bonus, this puts us back into sync with HTSJDK and the Picard toolkit, which had been running on Java 8 for a few months already. If you were doing it right, you had both versions of Java installed and ran each toolkit, GATK and Picard, on the appropriate version. How much hassle? Too much hassle! — Now you can run everything on Java 8.


ET finally gets home; discovers phone bill, flees to Canada

Here's another change that will affect everyone regardless of use case, in a good way: we removed the Phone Home usage reporting system. It served its purpose for many years, but we've outgrown it. So we ripped it all out. If you previously used a key to deactivate it with the -et NO_ET and —key <your_key> arguments, you can stop and take those out. Or if you're just too busy, leave them in -- the Phone Home code is all gone but we left in the argument definitions, so the parser shouldn't fail if you leave them in your commands. This shouldn't break any scripts.


Indel realignment tools drop out, go open-source

In the next few days we'll be making some updates to our Best Practices documentation to reflect updates to our production pipelines, and one thing you'll notice is that local realignment around indels is no longer included in the pre-processing part of the pipelines. More on that in a follow-up post; in a nutshell, indel realignment is just not useful enough anymore when you're calling variants with haplotype-based tools like HaplotypeCaller and MuTect2. This does mean we will no longer support the indel realignment tools as actively; but since others may care more about preserving and possibly even expanding this functionality, we've decided to move the relevant tools, IndelRealigner and RealignerTargetCreator, and related classes to the part of the GATK that is open-sourced under the MIT license.


Confidence matters

In response to popular demand, we made two frequently requested enhancements to the output of the variant discovery tools.

First, HaplotypeCaller and GenotypeGVCFs will now emit a no-call (./.) for any sample where GQ is zero, in both normal and GVCF modes, instead of emitting a specific genotype in which we have zero confidence. Note that these do not necessarily indicate that the variant call itself is bad, but that we are unable to choose between two (or more) genotypes for that particular sample.

Second, GenotypeGVCFs will now emit a QUAL value for invariant sites (i.e. where all samples are genotyped homozygous-variant) when run in -allSites mode. By the way, we're aware that many of you would like to see a GVCF-like output option for the final VCF, in which you'd get essentially the same informational content as an all-sites VCF but with some form of interval-block compression. We still do not provide that functionality, and we do not have any immediate plans to work on implementing it in GATK3; however we are talking about how to make something like that happen in GATK4. Why the feet-dragging? Well, while it's not hard to do something like that badly, doing it correctly is quite a bit harder, so it will take some non-trivial development effort —- and since it's not something we currently need at Broad for our production needs, it's difficult to prioritize. But hey, the more you tell us you want this, the more we can justify putting it in the roadmap, so leave a comment below if you care about this.


Bug in the spotlight

Today's bug of infamy concerns the handling of allele depths (AD) in the joint variant discovery workflow, when the AD of the NON_REF allele is non-zero. This was an interesting little edge case. First we have HaplotypeCaller emitting a GVCF where, at a given position, the AD for the NON_REF allele is non-zero. This is fine, even expected. The problem arises during GVCF merging or joint genotyping, when CombineGVCFs or GenotypeGVCFs takes this NON_REF AD count and -- this is the bug -- copies it over to one of the alleles seen in another sample. This results in observing sum(AD) > DP, which is obviously not good. The new behavior, post-fix, is that we still record the NON_REF AD in the single-sample GVCF, but we don't copy it over to concrete alleles in CombineGVCFs or GenotypeGVCFs. Because, duh.


Dude, where's my read?

Some days it seems like the number one preoccupation of GATK forum denizens is tracking down what happened to every last read in a pileup. That's probably because HaplotypeCaller and MuTect2, the two stars of our little show, can be awfully picky about what reads they include at key stages of the calling process. Their (totally justified) tendency to discount reads that they don't consider useful often causes confusion and/or concerns that reads are somehow being dropped for no reason. That's a concern we'd like to help alleviate. HaplotypeCaller and MuTect2 both already have an option to output a bam file showing the post-assembly read alignments and candidate haplotypes, which we call a bamout and by default shows the state of the data used by HC and M2 to make their calls. Now we've added an option to include in that output any reads that were not considered informative or were dropped for any reason internal to the caller (separate from any read filtering done by the engine). The option is called --emitDroppedReads and should be used in conjunction with -bamout.

Post edited by Geraldine_VdAuwera on

Comments

  • EADGEADG KielMember

    Yeah JAVA 8 finallly XD The --emitDroppedReads command is also nice.

  • bwainechobwainecho Bay areaMember

    We WANT GVCF-like output option for the final VCF!!!

    How does Broad deal with no-calls? Hard to believe it's not in the production flow.

    Issue · Github
    by Sheila

    Issue Number
    985
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @bwainecho We hear you :)

    In the Broad's pipeline, and specifically the *research* pipeline (as opposed to clinical), it's not a problem because all calling is done on cohorts. Because of how the joint analysis is set up, as long as you have even just one sample with a variant of interest, you will get information for that site in all samples. So for each sample you will have a genotype call, and if you have a no-call genotype you'll know why from the associated annotations. If you find you have sites where all samples are either reference or no-called due to lack of data, those sites will be unusable anyway (in the experimental designs I am familiar with anyway) -- you wouldn't be able to say anything about them -- so we don't care about having records for them (again, as far as I know).

    In the clinical context, things are a bit different because in many cases you may not be allowed to call samples with other patients' samples. In that case you have several options to work around the problem, depending on what you're looking for. If you're looking to genotype at specific markers you can run with `-allSites` with a list of sites of interest, which should give you all you need. If you don't have a specific list of sites and using global `-allSites` is not workable for you, then you would have to get the information some other way. One way is to do a coverage analysis to identify callable vs uncallable regions or targets (if working with exomes). I'm not sure exactly what is done in the Broad's clinical pipelines; will try to find out.
  • Hi @Geraldine_VdAuwera

    We currently have java 7 and GATK 3.5 on our Linux server.

    Our IT staff may not be able to upgrade our server from java 7 to java 8 anytime soon. Does this mean it is not recommended for me to download the latest GATK 3.6 and run it with java 7 ? Can I download GATK 3.6 and run it with java 7 ?

    Are these 2 tools "RealignerTargetCreator" and "IndelRealigner" still part of the GATK 3.6 tool kit ? (or) you just not recommend that it is not necessary to perform those 2 steps going forward ?

    Thanks,
    mglclinical

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @mglclinical

    Please use the recommended Java versions with GATK versions. If you download and use GATK v3.6, you should use it with Java 8. It is possible certain tools are backwards compatible. Definitely some tool are NOT backwards compatible. Given testing for this backwards compatibility is not something we do, you should always follow the recommendations. That is, only use GATK v3.6 with Java 8.

    RealignerTargetCreator and IndelRealigner are a part of the GATK v3.6 tool kit. You can get the list of tools for each version with java -jar GenomeAnalysisTK.jar -h.

    The recommendation that removes the indel realignment step from the Best Practice Workflows is for certain high quality data and workflows that use HaplotypeCaller. Please see this blogpost for more details.

  • Hi @shlee

    thank you for the quick response.

    I have not yet downloaded 3.6

    I will stick to 3.5 as java 7 is the recommended java version for 3.5 , and we cannot upgrade to java 8 quickly

    Thanks,
    mglclinical

  • xin.huangxin.huang Washington, DCMember

    @Geraldine_VdAuwera Looks like this version has many nice features and fixes, including the --emitDroppedReads tag. Thank you for your persistent work at making GATK more reliable and accommodating! Having the GVCF-like output for VCF would be fantastic.

    If I care about coverage at a certain locus, and hope to use the coverage information after HC locally re-assembles the reads, how many I achieve that? Now that when the GQ is zero for a locus, that locus is just a no-called, how may I know if this "no-called" is due to low confidence for the SNP calling, or the actual lack of coverage? I'm primarily concerned about RNA-Seq data, so expression would be interesting to me.

    Thanks,

    Xin

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    You can use the AD values for this purpose.
  • xin.huangxin.huang Washington, DCMember

    @Geraldine_VdAuwera thank you for your response! I've found there are several types of output in the FORMAT field. I'm wondering if GQ=0, the genotype is ./., and there is read coverage, the AD/DP value would be non-zero?

    BTW, I've noticed that for some places, where GQ=0, but a genotype is still called (reference-homozygote 0/0), for example in the following line:
    JXUM01S000035 275528 . G C 79.94 . AC=2;AF=0.200;AN=10;DP=12;ExcessHet=0.4576;FS=0.000;MLEAC=1;MLEAF=0.100;MQ=60.00;QD=26.51;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 0/0:3,0:3:9:.:.:0,9,81 0/0:3,0:3:0:.:.:0,0,24 ./.:0,0:0:.:.:.:0,0,0 1/1:0,2:2:6:1|1:275528_G_C:90,6,0 0/0:1,0:1:3:.:.:0,3,27 0/0:3,0:3:0:.:.:0,0,31 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @xin.huang
    Hi,

    Yes, I think this may be a rounding issue. Can you submit a bug report so I can reproduce that site? Instructions are here.

    Thanks,
    Sheila

  • fwang6fwang6 Houston,TX,USAMember
    edited August 2016

    Is mutect2 in this package the stable release version or it is still in beta?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @fwang6 Still in beta. On the bright side, the MuTect2 work has been prioritized for this quarter so we should have an improved version ready in a couple of months. Sorry for the long wait.
  • fwang6fwang6 Houston,TX,USAMember

    Thank you for your information!

    @Geraldine_VdAuwera said:
    @fwang6 Still in beta. On the bright side, the MuTect2 work has been prioritized for this quarter so we should have an improved version ready in a couple of months. Sorry for the long wait.

Sign In or Register to comment.