We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
(How to) Call common and rare germline copy number variants
Document is in
BETA. It may be incomplete and/or inaccurate. Post suggestions and read about updates in the Comments section.
The tutorial outlines steps in detecting germline copy number variants (gCNVs) and illustrates two workflow modes--cohort mode and case mode. The cohort mode simultaneously generates a cohort model and calls CNVs for the cohort samples. The case mode analyzes a single sample against an already constructed cohort model. The same workflow steps apply to both targeted exome and whole genome sequencing (WGS) data. The workflow is able to call both rare and common events and intelligently handles allosomal ploidies, i.e. cohorts of mixed male and female samples.
For the cohort mode, the general recommendation is at least a hundred samples to start. Researchers should expect to tune workflow parameters from the provided defaults. In particular, GermlineCNVCaller's default inference parameters are conservatively set for efficient run times.
The figure diagrams the workflow tools. Section 1 creates an intervals list and counts read alignments overlapping the intervals. Section 2 shows optional but recommended cohort mode steps to annotate intervals with covariates for use in filtering intervals as well as for use in explicit modeling. The section also removes outlier counts intervals. Section 3 generates global baseline observations for the data and models and calls the ploidy of each contig. Section 4 is at the heart of the workflow and models per-interval copy number. Because of the high intensity of compute model fitting requires, the section shows how to analyze data in parts. Finally, Section 5 calls per-sample copy number events per interval and per segment. Results are in VCF format.
► A highly recommended whitepaper detailing the methods is in the gatk GitHub repository's docs/CNV directory.
► For pipelined workflows, see the gatk GitHub repository's scripts/cnv_wdl directory. Be sure to obtain a tagged version of the script, e.g. v220.127.116.11, following instructions in Section 4 of Article#23405.
► This workflow is not appropriate for bulk tumor samples, as it infers absolute copy numbers. For somatic copy number alteration calling, see Tutorial#11682.
Article#11687 visualizes the results in IGV and provides followup discussion. Towards data exploration, here are two illustrative Jupyter Notebook reports that dissect the results.
- Notebook#11685 shows an approach to measuring concordance of sample NA19017 gCNV calls to 1000 Genomes Project truth set calls using tutorial
- Notebook#11686 examines gCNV callset annotations using larger data, namely chr20 gCNV results from the tutorial's 24-sample cohort.
Jump to a section
- (Optional) Annotate intervals with features and subset regions of interest with FilterIntervals
- Call autosomal and allosomal contig ploidy with DetermineGermlineContigPloidy
- GATK 18.104.22.168
Workflow tools DetermineGermlineContigPloidy, GermlineCNVCaller and PostprocessGermlineCNVCalls require a Python environment with specific packages, e.g. the gCNV computational python module gcnvkernel. See Article#12836 for instructions on setting up and managing the environment with the user-friendly conda. Once the conda environment is set up, e.g. with
conda env create -f gatkcondaenv.yml, activate it with
source activate gatkor
conda activate gatkbefore running the tool.
Alternatively, use the broadinstitute/gatk Docker, which activates the Python environment by default. Allocation of at least 8GB memory to Docker is recommended for the tutorial commands. See Article#11090 for instructions to launch a Docker container.
Download example data
The tutorial provides example small WGS data sourced from the 1000 Genomes Project. Cohort mode illustrations use 24 samples, while case mode illustrations analyze one sample against a cohort model made from the remaining 23 samples. The tutorial uses a fraction of the workflow's recommended hundred samples for ease of illustration. Furthermore, commands in each step use one of three differently sized intervals lists for efficiency. Coverage data are from the entirety of chr20, chrX and chrY. So although a step may analyze a subset of regions, it is possible to instead analyze all three contigs in case or cohort modes.
Download tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site. The bundle includes data for Notebook#11685 and Notebook#11686. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. The example data is from the 1000 Genomes project Phase 3 aligned to GRCh38.
1. Collect raw counts data with PreprocessIntervals and CollectReadCounts
PreprocessIntervals pads exome targets and bins WGS intervals. Binning refers to creating equally sized intervals across the reference. For example, 1000 base binning would define chr1:1-1000 as the first bin. Because counts of reads on reference
N bases are not meaningful, the tool automatically excludes bins with all
Ns. For GRCh38 chr1, non-N sequences start at base 10,001, so the first few bin become:
For WGS data, bin entirety of reference, e.g. with 1000 base intervals.
gatk PreprocessIntervals \ -R ~/ref/Homo_sapiens_assembly38.fasta \ --padding 0 \ -imr OVERLAPPING_ONLY \ -O grch38.preprocessed.interval_list
This produces a Picard-style intervals list of 1000 base bins.
For exome data, pad target regions, e.g. with 250 bases.
gatk PreprocessIntervals \ -R ~/ref/Homo_sapiens_assembly38.fasta \ -L targets.interval_list \ --bin-length 0 \ -imr OVERLAPPING_ONLY \ -O targets.preprocessed.interval_list
This produces a Picard-style intervals list of exome target regions padded by 250 bases on either side.
For the tutorial, bins three contigs.
The contigs in
gcnv-chr20XY-contig.list subset the reference to chr20, chrX and chrY.
gatk PreprocessIntervals \ -R ref/Homo_sapiens_assembly38.fasta \ --padding 0 \ -L gcnv-chr20XY-contig.list \ -imr OVERLAPPING_ONLY \ -O chr20XY.interval_list
This generates a Picard-style intervals list with 242,549 intervals. The file has a header section with
@ header lines and a five-column body. See Article#11009 for a description of the columns.
Comments on select parameters
- For WGS, the default 1000
--bin-lengthis the recommended starting point for typical 30x data. Be sure to set
--padding 0to disable padding outside of given genomic regions. Bin size should correlate with depth of coverage, e.g. lower coverage data should use larger bin size while higher coverage data can support smaller bin size. The size of the bin defines the resolution of CNV calls. The factors to consider in sizing include how noisy the data is, average coverage depth and how even coverage is across the reference.
- For targeted exomes, provide the exome capture kit's target intervals with
--bin-length 0to disable binning and pad the intervals with
--padding 250or other desired length.
- Provide intervals to exclude from analysis with
-XL, e.g. centromeric regions. Consider using this option especially if data is aligned to a reference other than GRCh38. The workflow enables excluding regions later again using
-XL. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the FilterIntervals step (section 2), the DetermineGermlineContigPloidy step (section 3), at the GermlineCNVCaller step (section 5) and/or post-calling.
CollectReadCounts tabulates the raw integer counts of reads overlapping an interval. The tutorial has already collected read counts ahead of time for the three contigs--chr20, chrX and chrY. Here, we collect read counts on small data.
Count reads per bin using CollectReadCounts
gatk CollectReadCounts \ -L chr20sub.interval_list \ -R ref/Homo_sapiens_assembly38.fasta \ -imr OVERLAPPING_ONLY \ -I NA19017.chr20sub.bam \ --format TSV \ -O NA19017.tsv
This generates a TSV format table of read counts.
Comments on select parameters
- The tutorial generates text-based TSV (tab-separated-value) format data instead of the default HDF5 format by adding
--format TSVto the command. Omit this option to generate the default HDF5 format. Downstream tools process HDF5 format more efficiently.
- Here and elsewhere in the workflow, set
OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals.
- The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter and MappingQualityReadFilter. This means the tool excludes reads marked as duplicate and excludes reads with mapping quality less than 10. Change the mapping quality threshold with the
After the SAM format header section, denoted by lines starting with
@, the body of the data has a column header line followed by read counts for every interval.
☞ 1.1 How do I view HDF5 format data?
See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in another tutorial.
2. (Optional) Annotate intervals with features and subset regions of interest with FilterIntervals
The steps in this section pertain to the cohort mode.
Researchers may desire to subset the intervals that GermlineCNVCaller will analyze, either to exclude potentially problematic regions or to retain only regions of interest. For example one may wish to exclude regions where all samples in a large cohort have copy number zero. Filtering intervals can be especially impactful for analyses that utilize references other than GRCh38 or that are based on sequencing technologies affected by sequence context, e.g. targeted exomes. The tutorial data is WGS data aligned to GRCh38, and the gCNV workflow can process the entirety of the data, without the need for any interval filtering.
Towards deciding which regions to exclude, AnnotateIntervals labels the given intervals with GC content and additionally with mappability and segmental duplication content if given the respective optional resource files. FilterIntervals then subsets the intervals list based on the annotations and other tunable thresholds. Later, GermlineCNVCaller also takes in the annotated intervals to use as covariates towards analysis.
Explicit GC-correction, although optional, is recommended. The default v22.214.171.124
cnv_germline_cohort_workflow.wdl pipeline workflow omits explicit gc-correction and we activate it in the pipeline by setting
do_explicit_gc_correction":"True". The tutorial illustrates the optional AnnotateIntervals step by performing the recommended explicit GC-content-based filtering.
AnnotateIntervals with GC content
gatk AnnotateIntervals \ -L chr20XY.interval_list \ -R ref/Homo_sapiens_assembly38.fasta \ -imr OVERLAPPING_ONLY \ -O chr20XY.annotated.tsv
This produces a four-column table where the fourth column gives the fraction of GC content.
Comments on select parameters
- The tool requires the
-Rreference and the
-Lintervals. The tool calculates GC-content for the intervals using the reference.
Although optional for the tool, we recommend annotating mappability by providing a
--mappability-trackregions file in either
.bed.gzformat. Be sure to merge any overlapping intervals beforehand. The tutorial omits use of this resource.
GATK recommends use of the the single-read mappability track, as the multi-read track requires much longer times to process. For example, the Hoffman lab at the University of Toronto provides human and mouse mappability BED files for various kmer lengths at https://bismap.hoffmanlab.org/. The accompanying publication is titled Umap and Bismap: quantifying genome and methylome mappability.
Optionally and additionally, annotate segmental duplication content by providing a
--segmental-duplication-trackregions file in either
- Exclude undesirable intervals with the
-XLparameter, e.g. intervals corresponding to centromeric regions.
FilterIntervals takes preprocessed intervals and either annotated intervals or read counts or both. It can also exclude intervals given with
-XL. When given both types of data, the tool retains the intervals that intersect from filtering on each data type. The v126.96.36.199
cnv_germline_cohort_workflow.wdl pipeline script requires read counts files, and so by default the pipeline script always performs the FilterIntervals step on read counts.
FilterIntervals based on GC-content and cohort extreme counts
gatk FilterIntervals \ -L chr20XY.interval_list \ --annotated-intervals chr20XY.annotated.tsv \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ -imr OVERLAPPING_ONLY \ -O chr20XY.cohort.gc.filtered.interval_list
This produces a Picard-style intervals list containing a subset of the starting intervals (230,126 of 242,549). Of the filtered intervals, GC-content filters 42 intervals, and extreme coverage counts from the 24-sample cohort remove an additional 12,381 intervals for a total of 12,423 removed filtered intervals (5.122% of starting).
Comments on select parameters
- The tool requires the preprocessed intervals, provided with
-L, from Section 1.
Given annotated intervals with
--annotated-intervals, the tool filters intervals on the given annotation(s).
- GC-content thresholds are set by
--maximum-gc-content, where defaults are 0.1 and 0.9, respectively.
- Mappability thresholds are set by
--maximum-mappability. Defaults are 0.9 and 1.0, respectively.
- Segmental duplication content thresholds are set by
--maximum-segmental-duplication-content. Defaults are 0.0 and 0.5, respectively.
- GC-content thresholds are set by
Given read counts files, each with
-Iand in either HDF5 or TSV format, the tool filters intervals on low and extreme read counts with the following tunable thresholds.
--low-count-filter-count-thresholddefault is 5
--low-count-filter-percentage-of-samplesdefault is 90.0
--extreme-count-filter-minimum-percentiledefault is 1.0
--extreme-count-filter-maximum-percentiledefault is 99.0
--extreme-count-filter-percentage-of-samplesdefault is 90.0
The read counts data must match each other in intervals.
For the default parameters, the tool first filters intervals with a count less than 5 in greater than 90% of the samples. The tool then filters the remaining intervals with a count percentile less than 1 or greater than 99 in a percentage of samples greater than 90%. These parameters effectively exclude intervals where all samples have extreme outlier counts, e.g. are deleted.
To disable counts based filtering, omit the read counts or, e.g. when using the v188.8.131.52
cnv_germline_cohort_workflow.wdlpipeline script, set the two
percentage-of-samplesparameters as follows.
--low-count-filter-percentage-of-samples 100 \ --extreme-count-filter-percentage-of-samples 100 \
Provide intervals to exclude from analysis with
-XL, e.g. centromeric regions. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the FilterIntervals step (section 2), the DetermineGermlineContigPloidy step (section 3), at the GermlineCNVCaller step (section 5) and/or post-calling.
3. Call autosomal and allosomal contig ploidy with DetermineGermlineContigPloidy
DetermineGermlineContigPloidy calls contig level ploidies for both autosomal, e.g. human chr20, and allosomal contigs, e.g. human chrX. The tool determines baseline contig ploidies using sample coverages and contig ploidy priors that give the prior probabilities for each ploidy state for each contig. In this process, the tool generates global baseline coverage and noise data GermlineCNVCaller will use in section 5.
The tool determines baseline contig ploidies using the total read count per contig. Researchers should consider the impact of this for their data. For example, for the tutorial WGS data, the contribution of the PAR regions to total coverage counts on chrX is small and the tool correctly calls allosomal ploidies. However, consider blacklisting PAR regions for data where the contribution is disporportionate, e.g. targeted panels.
The cohort mode requires a
--contig-ploidy-priors table and produces a ploidy model.
gatk DetermineGermlineContigPloidy \ -L chr20XY.cohort.gc.filtered.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ --contig-ploidy-priors chr20XY_contig_ploidy_priors.tsv \ --output . \ --output-prefix ploidy \ --verbosity DEBUG
This produces a
ploidy-calls directory and a
ploidy-model directory. The
ploidy-calls directory contains a folder of data for each sample in the cohort including the contig ploidy calls. Each sample directory, e.g.
ploidy-calls/SAMPLE_0, contains five files.
contig_ploidy.tsvnotes the ploidy and genotype quality (GQ) of the ploidy call for each contig.
global_read_depth.tsvnotes an average depth value and an average ploidy across all the intervals of the sample.
mu_psi_s_log__.tsvcaptures the posterior mean for all of the modeled parameters.
sample_name.txtcontains the readgroup sample (RG SM) name.
std_psi_s_log__.tsvcaptures the standard deviation for all of the modeled paramters.
ploidy-model directory contains aggregated model data for the cohort. This is the model to provide to a case-mode DetermineGermlineContigPloidy analysis and to GermlineCNVCaller. The tutorial
ploidy-model directory contains the eight files as follows.
contig_ploidy_prior.tsvis a copy of the ploidy priors given to the tool.
gcnvkernel_version.jsonnotes the version of the kernel.
interval_list.tsvrecapitulates the intervals used, e.g. the filtered intervals.
The theano model automatically generates
std_ files and may append transformations it performs to the file name, e.g.
lowerbound as we see above. These are likely of interest only to advanced users.
The case mode calls contig ploidies for each sample against the ploidy model given by
--model. The following command runs sample NA19017 against a 23-sample cohort model.
gatk DetermineGermlineContigPloidy \ --model cohort-23wgs-20190213-contig-ploidy-model \ -I cvg/NA19017.tsv \ -O . \ --output-prefix ploidy-case \ --verbosity DEBUG
This produces a
ploidy-case-calls directory, which in turn contains a directory of sample data,
SAMPLE_0. A list of the five resulting files is some paragraphs above.
Comments on select parameters
- It is possible to analyze multiple samples simultaneously in a case mode command. Provide each sample with
- For the
-Lintervals, supply the most processed intervals list. For the tutorial, this is the filtered intervals. Note the case mode does not require explicit intervals because the ploidy model provides them.
- Provide a
--contig-ploidy-priorstable containing the per-contig prior probabilities for integer ploidy state. Again, the case mode does not require an explicit priors file as the ploidy model provides them. Tool documentation describes this resource in detail. The tutorial uses the following contig ploidy priors.
Optionally provide intervals to exclude from analysis with
-XL, e.g. pseudoautosomal (PAR) regions, which can skew results for certain data.
The results for NA19017, from either the cohort mode or the case mode, show ploidy 2 for chr20 and chrX and ploidy 0 for chrY. The PLOIDY_GQ quality metrics differ slightly for the modes. The entirety of NA19017's
contig_ploidy.tsv is shown.
Checking the ploidy calls for each of the 24 samples against metadata confirms expectations. The following table summarizes results for the 24 samples. The data was collated from DetermineGermlineContigPloidy results using a bashscript.
It should be noted, the tutorial's default parameter run gives XY samples CN1 for the majority of chrX, including for PAR regions, where coverage is actually on par with the CN2 of XX samples. See Article#11687 for further discussion.
4. Call copy number variants with GermlineCNVCaller
GermlineCNVCaller learns a denoising model per scattered shard while consistently calling CNVs across the shards. The tool models systematic biases and CNVs simultaneously, which allows for sensitive detection of both rare and common CNVs. For a description of innovations, see Blog#23439.
As the tool documentation states under Important Remarks (v184.108.40.206), the tool should see data from a large enough genomic region so as to be exposed to diverse genomic features. The current recommendation is to provide at least ~10–50Mbp genomic coverage per scatter. This applies to exomes or WGS. This allows reliable inference of bias factors including GC bias. The limitation of analyzing larger regions is available memory. As an analysis covers more data, memory requirements increase.
For expediency, the tutorial commands below analyze small data, specifically the 1400 bins in
twelveregions.cohort.gc.filtered.interval_list and use default parameters. The tutorial splits the 1400 bins into two shards with 700 bins each to illustrate scattering. This results in ~0.7Mbp genomic coverage per shard. See section 4.2 for how to split interval lists by a given number of intervals. Default inference parameters are conservatively set for efficient run times.
The tutorial coverage data are sufficient to analyze the ~15Mb in
chr20sub.cohort.gc.filtered.interval_listas well as the entirety of chr20, chrX and chrY using the ~230Mb of
chr20XY.cohort.gc.filtered.interval_list. The former, at 5K bins per shard, give three shards. When running the default parameters in a GATKv220.127.116.11 Docker locally on a MacBook Pro, each cohort-mode shard analysis takes ~20 minutes. The latter gives 46 shards at 5K bins per shard. When running the default parameters of the v18.104.22.168 WDL cohort-mode workflow on the cloud, the majority of the shard analyses complete in half an hour.
Call gCNVs on the 24-sample cohort in two scatters. Notice the different
-L intervals and
gatk GermlineCNVCaller \ --run-mode COHORT \ -L scatter-sm/twelve_1of2.interval_list \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ --contig-ploidy-calls ploidy-calls \ --annotated-intervals twelveregions.annotated.tsv \ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort24-twelve \ --output-prefix cohort24-twelve_1of2 \ --verbosity DEBUG
gatk GermlineCNVCaller \ --run-mode COHORT \ -L scatter-sm/twelve_2of2.interval_list \ -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \ -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \ -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \ -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \ -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \ -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \ --contig-ploidy-calls ploidy-calls \ --annotated-intervals twelveregions.annotated.tsv \ --interval-merging-rule OVERLAPPING_ONLY \ --output cohort24-twelve \ --output-prefix cohort24-twelve_2of2 \ --verbosity DEBUG
This produces per-interval gCNV calls for each of the cohort samples and a gCNV model for the cohort. Each command produces three directories within
cohort24-twelve_1of2-calls folder of per sample gCNV call results, a
cohort24-twelve_1of2-model folder of cohort model data and a
cohort24-twelve_1of2-tracking folder of data that tracks model fitting. The table below lists the cohort mode data files alongside case mode files.
Call gCNVs on a sample against a cohort model. The case analysis must use the same scatter approach as the model generation. So, as above, we run two shard analyses. Here,
--output-prefix differ between the scatter the commands.
gatk GermlineCNVCaller \ --run-mode CASE \ -I cvg/NA19017.tsv \ --contig-ploidy-calls ploidy-case-calls \ --model cohort23-twelve/cohort23-twelve_1of2-model \ --output case-twelve-vs-cohort23 \ --output-prefix case-twelve-vs-cohort23_1of2 \ --verbosity DEBUG
gatk GermlineCNVCaller \ --run-mode CASE \ -I cvg/NA19017.tsv \ --contig-ploidy-calls ploidy-case-calls \ --model cohort23-twelve/cohort23-twelve_2of2-model \ --output case-twelve-vs-cohort23 \ --output-prefix case-twelve-vs-cohort23_2of2 \ --verbosity DEBUG
This produces both
tracking folders with, e.g. the
case-twelve-vs-cohort23_1of2 basename. The
case-twelve-vs-cohort23_1of2-calls folder contains case sample gCNV call results and the
case-twelve-vs-cohort23_1of2-tracking folder contains model fitting results. The case mode results files are listed in the table below alongside cohort mode data files.
Comments on select parameters
-Ooutput directory must be extant before running the command. Future releases (v22.214.171.124) will create the directory.
- The default
--max-copy-numberis capped at 5. This means the tool reports any events with more copies as CN5.
- For the cohort mode, optionally provide
--annotated-intervalsto include the annotations as covariates. These must contain all of the
-Lintervals is an exact match or a subset of the annotated intervals.
- For the case mode, the tool accepts only a single
--modeldirectory at a time. So the case must be analyzed with the same number of scatters as the cohort model run. The case mode parameters appear fewer than the cohort mode because the
--modeldirectory provides the seemingly missing requirements, i.e. the scatter intervals and the annotated intervals.
- For both modes, provide the
--contig-ploidy-callsresults from DetermineGermlineContigPloidy (Section 3). This not only informs ploidy but also establishes baseline coverage and noise levels for each sample. Later, in section 5, GermlineCNVCaller's shard analyses refer back to these global observations.
--verbosity DEBUGallows tracking the Python gcnvkernel model fitting in the stdout, e.g. with information on denoising epochs and whether the model converged. The default INFO level verbosity is the next most verbose and emits only GATK Engine level messages.
At this point, the workflow has done its most heavy lifting to produce data towards copy number calling. In Section 5, we consolidate the data from the scattered GermlineCNVCaller runs, perform segmentation and call copy number states.
One artificial construct of the tutorial is the use of full three contig ploidy calls data even when modeling copy number states for much smaller subset regions. This effectively stabilizes the small analysis.
☞ 4.1 How do I increase the sensitivity of detection?
The tutorial uses default GermlineCNVCaller modeling parameters. However, researchers should expect to tune parameters for data, e.g. from different sequencing technologies. For tuning, first consider the coherence length parameters, p-alt, p-active and the psi-scale parameters. These hyperparameters are just a few of the plethora of adjustable parameters GermlineCNVCaller offers. Refer to the GermlineCNVCaller tool documentation for detailed explanations, and ask on the GATK Forum for further guidance.
The tutorial illustrates one set of parameter changes for WGS data provided by @markw of the GATK SV (Structural Variants) team that dramatically increase the sensitivity of calling on the tutorial data. Article#11687 and Notebook#11686 compare the results of using default vs. the increased-sensitivity parameters. Given the absence of off-the-shelf filtering solutions for CNV calls, when tuning parameters to increase sensitivity, researchers should expect to perform additional due diligence, especially for analyses requiring high precision calls.
WGS parameters that increase the sensitivity of calling from @markw
--class-coherence-length 1000.0 \ --cnv-coherence-length 1000.0 \ --enable-bias-factors false \ --interval-psi-scale 1.0E-6 \ --log-mean-bias-standard-deviation 0.01 \ --sample-psi-scale 1.0E-6 \
Comments on select sensitivity parameters
--class-coherence-lengthfrom its default of 10,000bp to 1000bp decreases the expected length of contiguous segments. Factor for bin size when tuning.
--cnv-coherence-lengthfrom its default 10,000bp to 1000bp decreases the expected length of CNV events. Factor for bin size when tuning.
- Turning off
--enable-bias-factorsfrom the default
falseturns off active discovery of learnable bias factors. This should always be on for targeted exome data and in general can be turned off for WGS data.
--interval-psi-scalefrom its default of 0.001 to 1.0E-6 reduces the scale the tool considers normal in per-interval noise.
--log-mean-bias-standard-deviationfrom its default of 0.1 to 0.01 reduces what is considered normal noise in bias factors.
--sample-psi-scalefrom its default of 0.0001 to 1.0E-6 reduces the scale that is considered normal in sample-to-sample variance.
Additional parameters to consider include
--depth-correction-tauhas a default of 10000.0 (10K) and defines the precision of read-depth concordance with the global depth value.
--p-activehas a default of 1e-2 (0.01) and defines the prior probability of common CNV states.
p-althas a default of 1e-6 (0.000001) and defines the expected probability of CNV events (in rare CNV states).
☞ 4.2 How do I make interval lists for scattering?
This step applies to the cohort mode. It is unnecessary for case mode analyses as the model implies the scatter intervals.
cnv_germline_cohort_workflow.wdl pipeline workflow scatters the GermlineCNVCaller step. Each scattered analysis is on genomic intervals subset from intervals produced either from PreprocessIntervals (section 1) or from FilterIntervals (section 2). The workflow uses Picard IntervalListTools to break up the intervals list into roughly balanced lists.
gatk IntervalListTools \ --INPUT chr20sub.cohort.gc.filtered.interval_list \ --SUBDIVISION_MODE INTERVAL_COUNT \ --SCATTER_CONTENT 5000 \ --OUTPUT scatter
This produces three intervals lists with ~5K intervals each. For the tutorial's 1Kbp bins, this gives ~5Mbp genomic coverage per scatter. Each list is identically named
scattered.interval_list within its own folder within the
scatter directory. IntervalListTools systematically names the intermediate folders, e.g.
Comments on select parameters
--SUBDIVISION_MODE INTERVAL_COUNTmode scatters intervals into similarly sized lists according to the count of intervals regardless of the base count. The tool intelligently breaks up the
chr20sub.cohort.gc.filtered.interval_list's ~15K intervals into lists of 5031, 5031 and 5033 intervals. This is preferable to having a fourth interval list with just 95 intervals.
- The tool has another useful feature in the context of the gCNV workflow. To subset
-Ibinned intervals, provide the regions of interest with
--SECOND_INPUT) and use the
--ACTION OVERLAPSmode to create a new intervals list of the overlapping bins. Adding
--SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_CONTENT 5000will produce scatter intervals concurrently with the subsetting.
5. Call copy number segments and consolidate sample results with PostprocessGermlineCNVCalls
PostprocessGermlineCNVCalls consolidates the scattered GermlineCNVCaller results, performs segmentation and calls copy number states. The tool generates per-interval and per-segment sample calls in VCF format and runs on a single sample at a time.
Process a single sample from the 24-sample cohort using the sample index. For NA19017, the sample index is 19.
gatk PostprocessGermlineCNVCalls \ --model-shard-path cohort24-twelve/cohort24-twelve_1of2-model \ --model-shard-path cohort24-twelve/cohort24-twelve_2of2-model \ --calls-shard-path cohort24-twelve/cohort24-twelve_1of2-calls \ --calls-shard-path cohort24-twelve/cohort24-twelve_2of2-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ploidy-calls \ --sample-index 19 \ --output-genotyped-intervals genotyped-intervals-cohort24-twelve-NA19017.vcf.gz \ --output-genotyped-segments genotyped-segments-cohort24-twelve-NA19017.vcf.gz \ --sequence-dictionary ref/Homo_sapiens_assembly38.dict
NA19017 is the singular sample with index 0.
gatk PostprocessGermlineCNVCalls \ --model-shard-path cohort23-twelve/cohort23-twelve_1of2-model \ --model-shard-path cohort23-twelve/cohort23-twelve_2of2-model \ --calls-shard-path case-twelve-vs-cohort23/case-twelve-vs-cohort23_1of2-calls \ --calls-shard-path case-twelve-vs-cohort23/case-twelve-vs-cohort23_2of2-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ploidy-case-calls \ --sample-index 0 \ --output-genotyped-intervals genotyped-intervals-case-twelve-vs-cohort23.vcf.gz \ --output-genotyped-segments genotyped-segments-case-twelve-vs-cohort23.vcf.gz \ --sequence-dictionary ref/Homo_sapiens_assembly38.dict
Each command generates two VCFs with indices. The
genotyped-intervals VCF contains variant records for each analysis bin and therefore data covers only the interval regions. For the tutorial's small data, this gives 1400 records. The
genotyped-segments VCF contains records for each contiguous copy number state segment. For the tutorial's small data, this is 30 and 31 records for cohort and case mode analyses, respectively.
The two modes--cohort and case--give highly concordant but slightly different results for sample NA19017. The factor that explains the difference is the contribution of the sample itself to the model.
Comments on select parameters
- Specify a
--model-shard-pathdirectory for each scatter of the cohort model.
- Specify a
--calls-shard-pathdirectory for each scatter of the cohort or case analysis.
- Specify the
--contig-ploidy-callsdirectory for the cohort or case analysis.
- By default
--autosomal-ref-copy-numberis set to 2.
- Define allosomal contigs with the
- The tool requires specifying the
- Optionally generate segmented VCF results with
--output-genotyped-segments. The tool segments the regions between the starting bin and the ending bin on a contig. The implication of this is that even if there is a gap between two analysis bins on the same contig, if the copy number state is equal for the bins, then the bins and the entire region between can end up a part of the same segment. The extent of this smoothing over gaps depends on the
--sample-indexrefers to the index number given to a sample by GermlineCNVCaller. In a case mode analysis of a single sample, the index will always be zero.
--sequence-dictionaryis optional. Without it, the tool generates unindexed VCF results. Alternatively, to produce the VCF indices, provide the
-Rreference FASTA or use IndexFeatureFile afterward. The v126.96.36.199
cnv_germline_cohort_workflow.wdlpipeline workflow omits index generation.
Here is the result. The header section with lines starting with
## gives information on the analysis and define the annotations. Notice the singular
END annotation in the
INFO column that denotes the end position of the event. This use of the
END notation is reminiscent of GVCF format GVCF blocks.
In the body of the data, as with any VCF, the first two columns give the contig and genomic start position for the variant. The third
ID column concatenates together
REF column is always N and the
ALT column gives the two types of CNV events of interest in symbolic allele notation--
<DEL> for deletion and
<DUP> for duplication or amplification. Again, the
INFO field gives the
END position of the variant. The
FORMAT field lists sample-level annotations
GT of 0 indicates normal ploidy, 1 indicates deletion and 2 denotes duplication. The
CN annotation indicates the copy number state. For the tutorial small data,
CN ranges from 0 to 3.