Release notes for GATK version 3.5
GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.
- MuTect2: somatic SNP and indel caller based on HaplotypeCaller and the original MuTect.
- ContEst: estimation of cross-sample contamination (primarily for use in somatic variant discovery).
- GatherBqsrReports: utility to gather recalibration tables from scatter-parallelized BaseRecalibrator runs.
Variant Context Annotations
Added allele-specific version of existing annotations: AS_BaseQualityRankSumTest, AS_FisherStrand, AS_MappingQualityRankSumTest, AS_RMSMappingQuality, AS_RankSumTest,
AS_ReadPosRankSumTest, AS_StrandOddsRatio, AS_QualByDepth and AS_InbreedingCoeff.
Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.
Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.
Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).
Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.
Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.
Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.
Added new MQ jittering functionality to improve how VQSR handles MQ. Note that HaplotypeCaller now calculates a new annotation called RAW_MQ per-sample, which is then integrated per-cohort by GenotypeGVCFs to produce the MQ annotation.
VariantAnnotator can now annotate FILTER field from an external resource. Usage:
--resource:foo resource.vcf --expression foo.FILTER
VariantAnnotator can now check allele concordance when annotating with an external resource. Usage:
Bug fix: The annotation framework was improved to allow for the collection of sufficient statistics during GVCF creation which are then used to compute the final annotation during the genotyping. This avoids the use of median as the representative annotation from the collection of values (one from each sample). TL;DR annotations will be more accurate when using the GVCF workflow for joint discovery.
Variant manipulation tools
Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage:
--reference_window_stop Nwhere N is the desired cutoff.
Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.
Fixed ability to mask out snps with
Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.
Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.
Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.
Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.
Added the ability to SelectVariants to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument
--forceValidOutputconverts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice.
Improved VCF sequence dictionary validation. Note that as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime
-U ALLOW_SEQ_DICT_INCOMPATIBILITYcan be used (with caution) to override the check.
Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.
GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.
Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.
Changed the genotyping implementation for haploid organisms to address performance problems reported when running GenotypeGVCFs on haploid callsets. Note that this change may lead to a slight loss of sensitivity at low-coverage sites -- let us know if you observe anything dramatic.
Genotyping engine tweaks
Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on
Fixed bug that affected the
--ignoreInputSamplesbehavior of CalculateGenotypePosteriors.
Limited emission of the scary warning message about max number of alleles (“this tool is set to genotype at most x alleles but we found more; only x will be used”) to a single occurrence unless DEBUG logging mode is activated. Otherwise it fills up our output logs.
Miscellaneous tool fixes
Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.
Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.
Fixed a bug in CatVariants. Previously, VCF files were being sorted solely on the base pair position of the first record, ignoring the chromosome. This can become problematic when merging files from different chromosomes, especially if you have multiple VCFs per chromosome. Contributed by John Wallace.
Engine-level behaviors and capabilities
Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.
Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is
.interval_list, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously.
Adding static binning capability for base recalibration (BQSR).
Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag
maximumNumberOfJobsToRunConcurrently. Contributed by Johan Dahlberg.
Updated extension for Picard CalculateHsMetrics to include
PER_TARGET_COVERAGEargument and added extension for Picard CollectWgsMetrics.
- BeagleOutputToVCF, VariantsToBeagleUnphased, ProduceBeagleInput. These are tools for handling Beagle data. The latest versions of Beagle support VCF input and output, so there is no longer any reason for us to provide converters.
- ReadAdaptorTrimmer and VariantValidationAssessor. These were experimental tools which we think are not useful and not operating on a sufficiently sound basis.
- BaseCoverageDistribution and CoveredByNSamplesSites. These tools were redundant with DiagnoseTargets and/or DepthOfCoverage.
- LiftOverVariants, FilterLiftedVariants and liftOverVCF.pl. The Picard liftover tool LiftoverVCF works better and is easier to operate.
- sortByRef.pl. Use Picard SortVCF instead.
- ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.
- Made various documentation improvements.
- Updated date and street address in license text.
- Moved htsjdk & picard to version 1.141