To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Release notes for GATK version 3.2

ebanksebanks Broad InstituteMember, Broadie, Dev
edited October 2014 in Announcements

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.


We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.


Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

AnalyzeCovariates

  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

CalculateGenotypePosteriors

  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

FastaAlternateReferenceMaker

  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

Miscellaneous

  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.
Post edited by Geraldine_VdAuwera on

Comments

  • mikedmiked Member

    Can I process a gVCF generated by HC v3.1 downstream with CombineGVCFs and GenotypeGVCFs v3.2 ?

    Does this cause backwards incompatibility:
    "Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods."

    I'm interested in using v3.2 CombineGVCFs and CatVariants because a bug has been fixed allowing it to support gzipped VCFs as input and output as previously reported here: http://gatkforums.broadinstitute.org/discussion/3904/incremental-joint-variant-discovery-and-number-of-samples

    Thanks for the help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The two versions of HaplotypeCaller are technically compatible, so running 3.1 output gVCFs through 3.2 should work, but it comes with a big caveat: if you do this for a dataset generated with 3.1, then add new samples called using 3.2 to your cohort, you may end up with batch effects. While the difference between 3.0 and 3.1 was minimal, there is substantially more difference between 3.1 and 3.2. Results coming out of 3.2 will be better and have qualitatively different information (e.g. the post-reassembly AD and DP values as you mention), which is undesirable for project consistency. So we do recommend sticking with one version for a given project. But if you have your entire cohort and just want to run it as a "one and done" analysis, that should be okay. Just don't mix and match GVCFs from different versions.

  • erikterikt Member

    Hi! First comment, so I want to thank all of you at Broad for all your work on this incredible tool, for sharing it with the greater community, and for supporting it here! My question is also about compatibility, but going back a step. I just finished setting up and running the 3.1 pipeline on some WES data. As the v3.2 HC is said to have significant improvements I would like to rerun with this version, but I wonder if it is necessary/advantageous to rerun the pipe starting from the realignment step, or can I start from my final merged bams?

    Thanks,
    Erik

  • KinMokKinMok LondonMember

    Yes, I have similar question as Erik. Do I need to re-run indel realignment and base quality re-calibration for the samples that were done with V3.1? Or simply rerun with version 3.2 starting from HaplotypeCaller emitting gvcf?
    Thanks
    Kin

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @erikt & @KinMok‌

    Glad you like the tools, Erik :)

    No, it is not necessary to redo the data processing (realignment & BQSR) on data that was previously processed using versions 2.8 or later. You can just rerun from the HaplotypeCaller step.

  • Hi, @Geraldine_VdAuwera, should I use corresponding bundle dateset when I update GATK and I wonder where to get the bundle dateset to v3.2, I don't find bundle 3.2 under FTP site:ftp.broadinstitute.org/gsapubftp-anonymous/bundle

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @chenyu600‌,

    We don't issue a new bundle for every version. Since nothing has changed in the resources files since 2.8, you can use that version for working with GATK 3.2.

  • My project includes 700 WES samples that are divided into different sequencing batches. Since those batches came at different times over the past one and half year, they were processed using different versions of BWA and GATK. Now I want to re-run the whole cohort of 700 samples from HC using GATK v3.2, and I have two questions:
    1. what would be the memory and time estimate to CombineGVCFs on all 700 sample in one step? will it be better to run in two steps, e.g. 350+350?
    2. Will the bam files from different versions of BWA and GATK produce batch effect?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @blueskypy,

    I can't give you an estimate, sorry. Considering you've done quite a bit of testing on CombineGVCFs you probably know more than me about that at this point! I would expect that it's more efficient to produce several smaller combined subsets than one huge one.

    For batch effects, see my earlier response to a similar question. We do recommend using the same version for everything, but bam files processed with 2.8 don't need to be redone.

  • knhoknho Indiana, USAMember

    Hi Geraldine,

    When using HaplotypeCaller in GATK version 3.2, the Queue is recommended for parallel computing instead of multithreading because of reported issues according to the GATK document. Did the issues cause any problems to the output results? If I use multithreadiing (-nct) to parallel HaplotypeCaller, can the output results be wrong?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @knho,

    No, in cases of multithreading related issues, the program may fail to complete the run, but we are not aware of any incorrect results being output when a run completes successfully.

  • knhoknho Indiana, USAMember

    Thanks, Geraldine. In my experience with ~800 WGS dataset, several HaplotypeCaller jobs had error messages "error='Cannot allocate memory'". Is it one of multithreading related issues?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @knho That sounds like a generic java memory error. It can happen in relation to multithreading.

Sign In or Register to comment.