HC step 1: Defining ActiveRegions by measuring data entropy
This document describes the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.
To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.
1. Calculating the raw activity profile
Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.
In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.
If using the mode for genotyping given alleles (GGA) or the advanced-level flag
-useAlleleTrigger, and the site is overlapped by an allele in the VCF file provided through the
-alleles argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.
This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the
2. Smoothing the activity profile
The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.
Unless one of the special modes is enabled (GGA or allele triggering), the position profile value will be copied over to adjacent regions if enough high quality soft-clipped bases immediately precede or follow that position in the original alignment. At time of writing, high-quality soft-clipped bases are those with quality score of Q29 or more. We consider that there are enough of such a soft-clips when the average number of high quality bases per soft-clip is 7 or more. In this case the site profile value is copied to all bases within a radius of that position as large as the average soft-clip length without exceeding a maximum of 50bp.
Each profile value is then divided and spread out using a Gaussian kernel covering up to 50bp radius centered at its current position with a standard deviation, or sigma, set using the
-bandPassSigmaargument (current default is 17 bp). The larger the sigma, the broader the spread will be.
For each position, the final smoothed value is calculated as the sum of all its profile values after steps 1 and 2.
3. Setting the ActiveRegion thresholds and intervals
The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using
If the region size falls within the limits we leave it untouched (it's good to go).
If the region size is shorter than the minimum, it is greedily extended forward ignoring that cut point and we come back to step 1. Only if this is not possible because we hit a hard-limit (end of the chromosome or requested analysis interval) we will accept the small region as it is.
If it is too long, we find the lowest local minimum between the maximum and minimum region size. A local minimum is a profile value preceded by a large one right up-stream (-1bp) and an equal or larger value down-stream (+1bp). In case of a tie, the one further downstream takes precedence. If there is no local minimum we simply force the cut so that the region has the maximum active region size.
Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.
There is a final post-processing step to clean up and trim the ActiveRegion:
Remove bases at each end of the read (hard-clipping) until there a base with a call quality equal or greater than minimum base quality score (customizable parameter
-mbq, 10 by default).
Include or exclude remaining soft-clipped ends. Soft clipped ends will be used for assembly and calling unless the user has requested their exclusion (using
-dontUseSoftClippedBases), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).
Clip off adaptor sequences of the read if present.
Discard all reads that no longer overlap with the ActiveRegion after the trimming operations described above.
Downsample remaining reads to a maximum of 1000 reads per sample, but respecting a minimum of 5 reads starting per position. This is performed after any downsampling by the traversal itself (
-dcovetc.) and cannot be overriden from the command line.