The GATK 2.0 release includes both the addition of brand-new (and often still experimental) tools and updates to the existing stable tools.
Base Recalibrator (BQSR v2), an upgrade to CountCovariates/TableRecalibration that generates base substitution, insertion, and deletion error models.
Reduce Reads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
Plus powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.
Base Quality Score Recalibration
IMPORTANT: the Count Covariates and Table Recalibration tools (which comprise BQSRv1) have been retired! Please see the BaseRecalibrator tool (BQSRv2) for running recalibration with GATK 2.0.
Handle exception generated when non-standard reference bases are present in the fasta.
Bug fix for indels: when checking the limits of a read to clip, it wasn't considering reads that may already have been clipped before.
Now emits the MLE AC and AF in the INFO field.
Don't allow N's in insertions when discovering indels.
Phase By Transmission
Multi-allelic sites are now correctly ignored.
Reporting of mendelian violations is enhanced.
Corrected TP overflow.
Fixed bug that arose when no PLs were present.
Added option to output the father's allele first in phased child haplotypes.
Fixed a bug that caused the wrong phasing of child/father pairs.
Improvements to the validation report module: if eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.
If present, the AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC).
Fixed bugs in the VariantType and IndelSize stratifications.
FisherStrand annotation no longer hard-codes in filters for bases/reads (previously used MAPQ > 20 && QUAL > 20).
Miscellaneous bug fixes to experimental annotations.
Added a Clipping Rank Sum Test to detect when variants are present on reads with differential clipping.
Fixed the ReadPos Rank Sum Test annotation so that it no longer uses the un-hardclipped start as the alignment start.
Fixed bug in the NBaseCount annotation module.
The new TandemRepeatAnnotator is now a standard annotation while HRun has been retired.
Added PED support for the Inbreeding Coefficient annotation.
Don't compute QD if there is no QUAL.
Variant Quality Score Recalibration
The VCF index is now created automatically for the recalFile.
Now allows you to run with type unsafe JEXL selects, which all default to false when matching.
Added an option which allows the user to re-genotype through the exact AF calculation model (if PLs are present) in order to recalculate the QUAL and genotypes.
Added --mergeInfoWithMaxAC argument to keep info fields from the input with the highest AC value.
Somatic Indel Detector
GT header line is now output.
Automatically skips Ion reads just like it does with 454 reads.
Variants To Table
Genotype-level fields can now be specified.
Added the --moltenize argument to produce molten output of the data.
Depth Of Coverage
Fixed a NullPointerException that could occur if the user requested an interval summary but never provided a -L argument.
BCF2 support in tools that output VCFs (use the .bcf extension).
The GATK Engine no longer automatically strips the suffix "Walker" after the end of tool names; as such, all tools whose name ended with "Walker" have been renamed without that suffix.
Fixed bug when specifying a JEXL expression for a field that doesn't exist: we now treat the whole expression as false (whereas we were rethrowing the JEXL exception previously).
There is now a global --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends).
Removed all code associated with extended events.
Algorithmically faster version of DiffEngine.
Better down-sampling fixes edge case conditions that used to be handled poorly. Read Walkers can now use down-sampling.
GQ is now emitted as an int, not a float.
Fixed bug in the Beagle codec that was skipping the first line of the file when decoding.
Fixed bug in the VCF writer in the case where there are no genotypes for a record but there are genotypes in the header.
Miscellaneous fixes to the VCF headers being produced.
Fixed up the BadCigar read filter.
Removed the old deprecated genotyping framework revolving around the misordering of alleles.
Extensive refactoring of the GATKReports.
Picard jar updated to version 1.67.1197.
Tribble jar updated to version 110.
Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT