It looks like you're new here. If you want to get involved, click one of these buttons!
This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.
The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.
Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.
Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.
This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-22.214.171.124-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We have made the program jar specially available alongside the data bundle here. Note other tools in this program release may be unsuitable for analyses.
The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.
R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.
We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.
XQuartz for optional section 5. Your system may already have this installed.
In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.
We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.
java -jar gatk4.jar CalculateTargetCoverage \ -I <input_bam_file> \ -T <input_target_tsv> \ -transform PCOV \ -groupBy SAMPLE \ -targetInfo FULL \ –keepdups \ -O <output_pcov_file>
-Tis a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
-targetInfooption to FULL keeps the original target names from the target list.
–keepdupsoption asks the tool to include alignments flagged as duplicate.
The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.
For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?
In this step, we use two commands to create the CNV panel of normals (PoN).
The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.
What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?
The first step combines the proportional read counts from the multiple normal samples into a single file. The
-inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.
java -jar gatk4.jar CombineReadCounts \ -inputList normals.txt \ -O sandbox/combined-normals.tsv
The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).
java -jar gatk4.jar CreatePanelOfNormals \ -I sandbox/combined-normals.tsv \ -O sandbox/normals.pon \ -noQC \ --disableSpark \ --minimumTargetFactorPercentileThreshold 5
This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the
-noQC option and change the
--minimumTargetFactorPercentileThreshold to 5%.
Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?
In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.
java -jar gatk4.jar NormalizeSomaticReadCounts \ -I cov/tumor.tsv \ -PON sandbox/normals.pon \ -PTN sandbox/tumor.ptn.tsv \ -TN sandbox/tumor.tn.tsv
This produces the pre-tangent-normalized file
-PTN and the tangent-normalized file
-TN, respectively. Resulting data is log2-transformed.
Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.
Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.
java -jar gatk4.jar PerformSegmentation \ -TN sandbox/tumor.tn.tsv \ -O sandbox/tumor.seg \ -LOG
For our tumor sample, we reduce the ~73K individual targets to 14 segments. The
-LOG parameter tells the tool that the input coverages are log2-transformed.
Which chromosomes have events?
This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.
We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.
This is an optional step that plots segmented coverage.
This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.
java -jar gatk4.jar PlotSegmentedCopyRatio \ -TN sandbox/tumor.tn.tsv \ -PTN sandbox/tumor.ptn.tsv \ -S sandbox/tumor.seg \ -O sandbox \ -pre tumor \ -LOG
-O defines the output directory, and the
-pre defines the basename of the files. Again, the
-LOG parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.
Open each of these images. How many copy-number variants do you see?
Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.
In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.
java -jar gatk4.jar CallSegments \ -TN sandbox/tumor.tn.tsv \ -S sandbox/tumor.seg \ -O sandbox/tumor.called
View the results with cat sandbox/tumor.called.
Besides the last column, how is this result different from that of step 4?
Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).
What is different between the plots? Look closely.
The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.
As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.