We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
(How to) Call somatic copy number variants using GATK4 CNV

A more recent CNV tutorial using v4.0.1.1 has been posted in two parts elsewhere at:
- https://software.broadinstitute.org/gatk/documentation/article?id=11682 and
- https://software.broadinstitute.org/gatk/documentation/article?id=11683
The first part mostly recapitulates the workflow on this page, while the second part adds detection of allelic ratios. Although the v4.0.1.1 tutorial is under review as of May 2, 2018, we recommend you update to the official workflow, especially if performing CNV analyses on WGS data. The official workflow has algorithmic improvements to the GATK4.beta workflow illustrated here.
This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.
The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.
Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.
► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791.
► For the mathematics behind the workflow, see this whitepaper.
Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.
Jump to a section
- Collect proportional coverage using target intervals and read data using CalculateTargetCoverage
- Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals
- Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts
- Segment the normalized coverage profile using PerformSegmentation
☞ I get an error at this step! - (Optional) Plot segmented coverage using PlotSegmentedCopyRatio
☞ What is the QC value? - Call segmented copy number variants using CallSegments
- Discussion of interest to some
☞ Why can't I use just a matched normal?
☞ How do the results compare to SNP6 analyses?
Tools, system requirements and example data download
This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We previously made the program jar specially available alongside the data bundle in the workshops directory here. The original worksheets are in the
1610
folder. However, the data bundle was only available to workshop attendees. Note other tools in this program release may be unsuitable for analyses.The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.
R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.
Rscript install_R_packages.R
We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.
XQuartz for optional section 5. Your system may already have this installed.
- The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.
1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage
In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.
We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.
java -jar gatk4.jar CalculateTargetCoverage \ -I <input_bam_file> \ -T <input_target_tsv> \ -transform PCOV \ -groupBy SAMPLE \ -targetInfo FULL \ –keepdups \ -O <output_pcov_file>
- The target file
-T
is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity. - Setting the
-targetInfo
option to FULL keeps the original target names from the target list. - The
–keepdups
option asks the tool to include alignments flagged as duplicate.
The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.



For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?
2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals
In this step, we use two commands to create the CNV panel of normals (PoN).
The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.
What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?
The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList
parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.
java -jar gatk4.jar CombineReadCounts \ -inputList normals.txt \ -O sandbox/combined-normals.tsv
The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).
java -jar gatk4.jar CreatePanelOfNormals \ -I sandbox/combined-normals.tsv \ -O sandbox/normals.pon \ -noQC \ --disableSpark \ --minimumTargetFactorPercentileThreshold 5
This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC
option and change the --minimumTargetFactorPercentileThreshold
to 5%.
Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?
3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts
In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.
java -jar gatk4.jar NormalizeSomaticReadCounts \ -I cov/tumor.tsv \ -PON sandbox/normals.pon \ -PTN sandbox/tumor.ptn.tsv \ -TN sandbox/tumor.tn.tsv
This produces the pre-tangent-normalized file -PTN
and the tangent-normalized file -TN
, respectively. Resulting data is log2-transformed.
Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.
4. Segment the normalized coverage profile using PerformSegmentation
Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.
java -jar gatk4.jar PerformSegmentation \ -TN sandbox/tumor.tn.tsv \ -O sandbox/tumor.seg \ -LOG
For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG
parameter tells the tool that the input coverages are log2-transformed.
View the resulting file with cat sandbox/tumor.seg
.

Which chromosomes have events?
☞ I get an error at this step!
This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.
Rscript install_R_packages.R
We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.
5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio
This is an optional step that plots segmented coverage.
This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.
java -jar gatk4.jar PlotSegmentedCopyRatio \ -TN sandbox/tumor.tn.tsv \ -PTN sandbox/tumor.ptn.tsv \ -S sandbox/tumor.seg \ -O sandbox \ -pre tumor \ -LOG
The -O
defines the output directory, and the -pre
defines the basename of the files. Again, the -LOG
parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

- Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
- Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
- FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.
Open each of these images. How many copy-number variants do you see?
☞ What is the QC value?
Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.
- preQc.txt gives the QC value before tangent-normalization.
- postQc.txt gives the post-tangent-normalization QC value.
- dQc.txt gives the difference between pre and post QC values.
- scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).
6. Call segmented copy number variants using CallSegments
In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.
java -jar gatk4.jar CallSegments \ -TN sandbox/tumor.tn.tsv \ -S sandbox/tumor.seg \ -O sandbox/tumor.called
View the results with cat sandbox/tumor.called.

Besides the last column, how is this result different from that of step 4?
7. Discussion of interest to some
☞ Why can't I use just a matched normal?
Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).



What is different between the plots? Look closely.
The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.
☞How do the results compare to SNP6 analyses?
As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.
Comments
Hi @shlee,
can I also use the GATK4Pre-Release from the Docker-Container (Version:4.alpha-249-g7df4044-SNAPSHOT) ? The actual nightly don`t seems to contain CalculateTargetCoverage. Also in both release the arguments for input intervals and target feature file are listed under Optional Arguments, but you need to at least one of them to make the programm running.
In addition target-files for both releases seems to differ:
Version:4.alpha-249-g7df4044-SNAPSHOT needs a standard .bed-file with fileformat-tag and correct filename extension while the 1.2.3.PreRelease needs an tsv-file formated like that:
##fileFormat = tsv contig start stop name 1 1 101 target-0
Can you suggest which file format will be needed by the final release ?
Greetings EADG
Issue · Github
by Sheila
Hi @EADG,
You're correct that at least one of the optional arguments is required.
In the final release, we will most likely stick with the TSV file format. You can use the tool ConvertBedToTargetFile if you have a BED file you'd like to use.
Unfortunately, I am not familiar with the Docker image you ask about---maybe @LeeTL1220 can chime in here.
Hi,
This method looks great and I love the thoroughness of the approach. I have a few questions as we try to implement it:
The method as described starts with a BAM file. Is there an assumption that this bam file has been processed more than being aligned? That is, do you remove PCR duplicates from the BAM? Presumably BQSR or the like is not performed here?
Ideally, there would be a WDL file you could share that would show this pipeline with code - is that something you have/could point me to?
We are the Parker Institute are very interested in using this approach, and your help here will be great to get this going!
Issue · Github
by Sheila
Hi @dannykwells,
Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.
Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the
-keepdups
option to include duplicates. Our developers have found including the duplicates gives empirically better results.The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.
would this process work for very small panels or only exome-type panels?
Hi @Pepetideo,
This workflow works for exome-type panels. I believe it can also work for small panels so long as you have some requisite minimum number of samples in the PoN. This is because the normalization is per target site.
It would be helpful for us to know exactly what you mean by very small panels.
Hi @shlee Thanks! We are hopefully starting to implement soon.
Hi,
just for your information for the wdl-Workflow you need the actual gatk4-protected. The 1.2.3.PreRelease wont work!
Hi @shlee ,
is there a bug in the wdl-file ?
Line 67 In cnv_somatic_tasks.wdl is --transform PCOV, but CombineReadCounts need an absolut readcount. So it had to be set to RAW ... Can you confirm this ?
Greetings EADG
Hi @EADG,
You are asking whether CombineReadCounts requires raw or proportional coverage data. I believe it should take either. Are you getting an error message?
For the workflow presented in this page's tutorial, CombineReadCounts uses proportional coverage (PCOV) data.
Hi there,
i was wondering about the best approach to do CNV on WGS data?
In principle, I shouldn't need to define a target region/interval (-T / -L options) as I'm interested on the whole genome, however, to improve the performance it might be good to limit CNV on exome regions, right?
So my question would be, whether there is a way to look only for CNV in exonic regions or whether there is like a standard bed file for hg38 that defines exonic regions that I might use with CalculateTargetCoverage ?
Thank you!
Hi @leiendeckerlu,
Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We'll have a beta release jar here sometime this week or next I'm told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.
If you are interested, you can read up on the tool here. I just updated the document portion, which are the lines that start with an asterisk
*
.As for your question on whether to define target intervals, this depends on the quality of the analysis set reference you are using, whether you want to include the sex chromosomes knowing the implications for doing so, etc.
Remember that exonic regions are a small fraction of the whole genome. You can check our Resource Bundle for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.
Hi @shlee,
Thank you for this comprehensive answer! I've a few follow up questions:
That's great news and I will check out the GermlineCNVCaller once it is out!
However, I'm a little bit confused now: I only need the GermlineCNVCaller if I want to call for CNV in the germline, right? If I'm only interested in CNV in tumors (somatic), I create a PON with all the normal WGS files (> 20T/N) that I have (using the human interval list or as I described below) and then follow the pipeline described above, or am I missing something here?
Also, what's per definition the PON when doing germline CNV calling?
I just went ahead and created a bed file covering all canoncial transcript regions and used this for my analysis. Should work the same, right? I will rerun the analysis with the human intervals list and see how things change.
Again, thanks a lot!
Issue · Github
by Sheila
An additional comment regarding optional step 5 above:
My version requires the additional option of a sequence dictionary "-SD", e.g. hg38.dict. I'm not sure what's the case in the final GATK4 but maybe you should/could add that above.
Hi @shlee,
yes Iam getting an error-message that CombineReadCounts requires an integer-value in the tsv-file (wdl-workflow only). So i changed --transform PCOV to RAW.
Now Iam running in an other error/exeception right at the end: CreatePanelOfNormals throws the following stack trace:
Iam using the latest version of gatk4 protected. Any clue ?
Hi @shlee again,
i found what was rising the error...
First I disabled spark like recommended for the non-WDL workflow here. After that CreatePanelOfNormals was running for hours for 3 samples and a handful of targets without creating any output.
Then I replaced all 0 in the tsv-file with 1.0 and voilà CreatePanelOfNormals was finished in seconds.
Is there an option that CreatePanelOfNormals automatically ignores lines with 0 coverage for a specific target ?
Issue · Github
by shlee
Hi @leiendeckerlu,
I misanswered your initial question! Sorry about that and thanks for the followup. So yes, as you say the GermlineCNVCaller is for germline contexts and for your somatic contexts you want to use somatic workflow. For WGS data, to calculate read coverage, please use SparkGenomeReadCounts.
You are asking workflow questions and I think you'll find plumbing the newly public repo helpful. Specifically, there are WDL scripts at https://github.com/broadinstitute/gatk/tree/master/scripts. The somatic CNV scripts are in https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/somatic. One of these outlines PoN creation. We will be working on documenting all of these workflows in the forum going forward.
Remember there are differences in BED files versus GATK/Picard-style intervals. This is the difference between 0 and 1-based interval coordinates, repectively. You can use the new ConvertBedToTargetFile. I've just finished documenting it so the link will tell you more. Please be sure to use the GATK/Picard-style intervals with our other tools.
P.S. As for your comment:
These types of differences are the reason this tutorial mentions exactly which version of the tool we use. There is new optional reference handling in GATK4. These will be documented in the argument options.
@EADG,
I've asked a developer to look into your questions. We'll get back to you shortly.
Hi @EADG,
You should feel free to use Spark for the non-WDL workflow. When Spark is enabled, the Spark implementation of SVD is used to perform PCA (rather than the Apache Commons implementation).
The test case with a very small number of targets that you are trying to run is rather idiosyncratic. Furthermore, there is some hard filtering that occurs before the PCA that gets rid of low coverage or extreme targets, so your first case is probably getting reduced to a single target. This is probably causing the Apache implementation to fail silently. Can I ask what you are trying to test with this small number of targets?
Also, can you post the stacktrace for the previous error with CombineReadCounts that you were running into? Thanks!
Hi @slee,
thx for your quick answer. I just performed a dry run of the wdl-pipeline, that why i choose such a little number of targets and samples.
Is there a minimum number of samples/targets which you can recommend for a dry run ?
Is that the filtering which you mentioned ? It seems nothing being filtered for my case...
Stacktrace for CombineReadCount:
Greetings EADG
Hi @EADG,
In practice, I haven't found it necessary to do a dry run. But perhaps ~10 samples and ~100 targets would be enough to avoid any weird Apache-related errors.
I am also not sure why you are getting that CombineReadCounts error. Are you sure that Sample1.aligned.duplicate_marked.sorted.coverage.tsv is a properly formatted TSV? If looks like you may have some commas in that file that are being parsed incorrectly, resulting in the message that you see. Can you confirm if this is the case?
You should be able to pass PCOV coverage files to CombineReadCounts, and then pass the resulting file on to CreatePanelOfNormals, as is done in the WDL.
I tried running CalculateTargetCoverage with --groupBy SAMPLE option, it finished successfully but I don't see the 5th column (proportional coverage) in the output tsv file. When I tried --groupBy COHORT, I got the 5th column and all the values are NaN. Here the command I used:
java -Djava.io.tmpdir=$TMPDIR -Xmx4g -jar gatk4-protected-local.jar CalculateTargetCoverage --reference $REF -I $BAM -T $TARGETS -transform PCOV --targetInformationColumns FULL --groupBy SAMPLE -O ${OUTDIR}/${SM}.tsv
. Am I missing something?I @wleeidt,
The columns represent: contig, start, stop, name and then the sample. You are saying the first four columns have values (target information) but the sample column is empty of values for the targets. I have two suggestions.
[1] Include your duplicate reads with:
[2] Use the latest release at https://github.com/broadinstitute/gatk/releases. We've merged the protected repo with the gatk repo.
Let me know if you still get the error after [1], and then after [2].
Thanks @shlee. I downloaded the latest release and tried your suggestion with the
--disableReadFilter
parameter. It still didn't generate the sample column. Here is the command that I used:/mnt/mywork/bin/GATK4/gatk-4.beta.1/gatk-launch CalculateTargetCoverage -I B_2.sorted.bam -T input_target.tsv --transform PCOV --groupBy SAMPLE --targetInformationColumns FULL --disableReadFilter NotDuplicateReadFilter -O cov/B_2.targetCov.tsv
Hi @wleeidt,
Can you check two things for me then. First, can you
samtools view -H B_2.sorted.bam | grep '@RG'
and post what you see. Is there an SM field? Second, if all seems well with your RG fields, then can you run Picard CollectHsMetrics on your BAM to ensure there is coverage for your targets? You will have to convert your targets intervals to a Picard format target intervals. You can use the same target intervals for the BAIT_INTERVALS and TARGET_INTERVALS.Will this work if I have a single tumor-normal pair (not a panel)? Can I just make pon using single normal sample and use it as downstrem ?
Hi @slee,
sry for my late response, I was captured by a bunch of urgent analyses. I run your original wdl for pon creation with 80 samples and got the same error. I will attach a snippet of the file later this day.
Greetings EADG
Hi @anand_mt,
Please construct a CNV PoN from a minimum of 10 samples. The recommended number of samples is 40. CreatePanelOfNormals will error if you provide it just a single normal. It will also error if you provide it the same normal (with different names) multiple times. Also, the point of the workflow is to denoise coverage profiles with the PoN. Remember that copy number profiles are inherently noisy as there is some stochastic element to coverage depths. A good PoN is critical to discerning CNV events from inherently noisy coverages with the sensitivity that somatic analyses require.
P.S. I have just finished writing a new tutorial (for the UK workshop) that shows the effects of calling CNV events from two different CNV PoNs. You might find taking the hour to run through it helpful in understanding the importance of the PoN. In fact, let me just attach the worksheet here. It's the first iteration and will likely undergo further improvements. Please direct any questions you have about the new tutorial to me if anything is unclear.
Thank you for the document and all the nice explanation. I was about to do something stupidly similar which has been addressed in the document (Using TCGA blood normals as to create PoN). Thank you, super helpful.
Thanks @shlee! After adding read group in the bam header, I got the column with the coverage.
@shlee. I ran into problem running
gatk-launch CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv
, where the normals.txt contains the list of tsv files (in full path) generated byCalculateTargetCoverage
. I checked to make sure that the files in thenormals.txt
are present. But when I run CombineReadCounts, it complaint that it couldn't find the files: `Using GATK jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jarRunning:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv
12:19:22.811 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[July 13, 2017 12:19:22 PM PDT] CombineReadCounts --inputList normals.txt --output S1-S2-cov.tsv --maxOpenFiles 100 --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
[July 13, 2017 12:19:22 PM PDT] Executing as [email protected] on Linux 3.10.0-514.21.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Version: 4.beta.1
12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 1
12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:19:22.842 INFO CombineReadCounts - Deflater: IntelDeflater
12:19:22.842 INFO CombineReadCounts - Inflater: IntelInflater
12:19:22.842 INFO CombineReadCounts - Initializing engine
12:19:22.842 INFO CombineReadCounts - Done initializing engine
12:19:22.852 INFO CombineReadCounts - Shutting down engine
[July 13, 2017 12:19:22 PM PDT] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1561853952
A USER ERROR has occurred: Couldn't read file /usr/local/wlee/AHM/cov/S1.targetCov.tsv . Error was: /usr/local/wlee/AHM/cov/S1.targetCov.tsv (No such file or directory)`
@wleeidt,
Can you echo or cat your file to ensure it exists? Make sure the
tsv
file doesn't have a.txt
extension.Yes. they are in .tsv format. Here is a little snippet of the file:
##fileFormat = tsv
##commandLine = CalculateTargetCoverage --output
/mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/cov/20170707-HMLHJAFXX-S1.targetCov.tsv --groupBy sample --transform PCOV --targets /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/bed/Exomev1_CNVv2_Targets.merged.tsv --targetInformationColumns FULL --input /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/bam/20170707-HMLHJAFXX-S1.sorted.RG.bam --cohortName --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false##title = Read counts per target and sample
contig start stop name 20170707-HMLHJAFXX-S1
chr1 14870 14990 core_1 1.735e-05
chr1 69090 70008 469_358463_79501(OR4F5)_1_2 2.392e-05
chr1 367658 368597 469_360600_729759(OR4F29)_1_3 4.171e-05
chr1 565346 565466 core_2 2.082e-06
chr1 621095 622034 469_358708_81399(OR4F16)_1_4 4.382e-05
chr1 755880 756000 core_3 6.501e-06
Next thing to confirm is that none of your files have duplicate sample names?
For testing, I only have 2 samples, and they have unique sample names. I have tried running
CombineReadCounts
on another two samples, and it works. The process of generating the tsv files was consistent, usingCalculateTargetCoverage
.@wleeidt, that's odd. If you provide each coverage with the
-I
argument, does CombineReadCounts process them? If not, would you mind regenerating your coverages for the two samples fresh and then running CombineReadCounts on them?@shlee. Thanks. When I use
-I
argument for each of the coverage file in theCombineReadCounts
, it processes them just fine. But when I tried the argumentinputList normals.txt
, where normals.txt contains the fullpaths to the same two coverage files, it fails.Alright, then something's up with your file paths. I think the tool is able to handle both absolute and relative file paths. I don't think ownership of the target directory should make a difference, e.g. read-write permissions, as you should have at least read permission for all of your directories. Is there anything else you can say is different between the file paths that worked and did not work in your test of the
inputList
?HI @slee,
as promised i put an example of one of the tsv-files I feed to CombineReadCounts:
Still got the same error...
Hm when I review the course-material from last week I see that there a point(.) instead of a comma (,) maybe it is something wrong with my language options in linux/R/ or elsewhere ?
Thanks in advance
EADG
@wleeidt, I hope you were able to resolve your issue. If so, please do share so the wider community can benefit. If not, let's file a bug report. Instructions are in Article#1894.
@EADG,
Can you replace your commas with points and see if you still get the error? Thanks.
P.S. Our developer says that you should expect the error you see due to the commas.
@shlee. I ran
PlotSegmentedCopyRatio
, and it returned SUCCESS, but no plots are generated except for a file calledplotting_dump.rda
. Here is the commandgatk-4.beta.1/gatk-launch PlotSegmentedCopyRatio -TN S4_tumor.pn.tsv -PTN S4_tumor.ptn.tsv -S S4_tumor.seg -O sandbox -SD hg19.dict -pre S4_gatk4_cnv_segment -LOG
. I tried running it in Xterm on a mac, still no plot is generated.Hi @shlee, @slee,
yes it is working when I replace the commas with points, but since I am using the wdl-pipeline which was provided by slee on gitHub, this is not a sufficient workaround. Due to my expectation that the problem (commas instead of points) arise from my language settings I try different thing to change my actual language on the system. To make the long story short the solution was to start CalculateTargetCoverage with the flags (I added them in the cnv_somatic_tasks.wdl):
And now the cnv_somatic_panel_workflow.wdl runs smoothly to the end, with points instead of commas
An other solution would be to running the pipeline in a DockerContainer wich have an english java-version installed, like the one which is provide by the gatk-team.
I hope this helps other user too which are working with non-englisch java versions.
Greetings EADG
Issue · Github
by shlee
@wleeidt, Can you double check that the names of the contigs in the sequence dictionary match in representation to that in your data ?
@shlee. I checked the bam header and the sequence dictionary, and the configs are identical.
Thanks @EADG. I'll let our developer know that the tools themselves generate the data with the commas that then error downstream tools. And thanks for the workaround of adding
-Duser.language=en -Duser.country=US
to commands. For gatk-launch, these options would then be:Ok, @wleeidt. Let's fill out a bug report so we can recapitulate your error on our side and get to the source of the error. Instructions for filing a bug report are in Article#1894.
@shlee. Thanks for sending the instruction for submitting a bug report. I have put all the required files, the descriptions of how I ran the command, and the stdout of the tool in a text file in bugReport_by_wleeidt.updated.zip and uploaded to the ftp server.
Issue · Github
by shlee
Hi @wleeidt,
I have received your data and in my hands I can generate plots. Your data shows a negative dQc value that indicates PoN normalization was ineffective. I will have a developer look into better error messaging and they may followup here with questions.
Three dependencies pop into my head and these are:
Can you check to see if you have all of the above? If not, then can install these and run your command again? You can find links/instructions for these dependencies in the attached PDF. This is the first iteration (to be improved) of a somatic CNV tutorial that I recently finished writing. It compares the effects of using two different PoNs for denoising and how it can impact somatic CNV detection.
@shlee. I had installed XQuartz, and R 3.3.3 was installed. I just installed the R dependencies from https://github.com/broadinstitute/gatk/blob/master/scripts/docker/gatkbase/install_R_packages.R. Now
PlotSegmentedCopyRatio
is able to generate the plots. Thanks so much for your help!That's great @wleeidt. Glad you can move ahead.
@shlee. Is there a tool for plotting the B-allele frequency along with the copy number ratio?
@wleeidt, PlotACNVResults can plot both segmented copy-ratio and minor-allele-fraction estimates.
@shlee. I have a question about the PoN. You mentioned previously that it's recommended to have at least 10 normal samples to generate the PoN. What if I only have 2 normal samples? Is that going to affect the accuracy of the whole CNV analysis? I plotted the minor allele frequency using PlotACNVResults using GetBayesianHetCoverage to get the het SNPs with one normal bam and one tumor bam. And in the end, the minor allele frequency all hovering around 0.3, which looks strange.
Issue · Github
by shlee
@wleeidt,
In general, I think it best if you follow the recommendations of minimum 10 to 40 samples for the PoN. For your other questions, because I am unfamiliar with the ACNV workflow and GetBayesianHetCoverage, I am passing your question on to our developers. You can click on the Issue Number above to view any ensuing discussion.
@wleeidt,
If you only have 2 normal samples in your PoN, that will reduce the amount of denoising that you can do on your copy-ratio data. Noisy copy-ratio data may yield poor copy-ratio segments, and since the latter is used as input to ACNV, may ultimately affect minor-allele-fraction estimates.
However, from your plots, it looks like your copy-ratio data is not too noisy, so I don't think this is to blame for your hets/minor-allele-fraction result. It looks like the hets are indeed unusual---there are many hets with zero minor-allele-fraction and the rest are rather widely distributed in [0, 0.5].
I'm not sure if I can diagnose the issue without knowing more about your data. What was the depth of your normal and tumor? Were they sequenced using the identical capture kits? Could you post your GetBayesianHetCoverage and AllelicCNV command line invocations?
@shlee. I ran into an error in
AllelicCNV
that I didn't see before, and it only happened to one of my samples.@slee. The samples are using the same capture kit. Here are the commands:
gatk-launch --javaOptions "-Xmx8g" GetBayesianHetCoverage --reference hg19.fa --snpIntervals hg19.snpv147.interval_list --tumor S4.sorted.bam --normal S1.sorted.bam --tumorHets tumor.S4.hets.tsv --normalHets normal.S1.hets.tsv
andgatk-launch --javaOptions "-Xmx8g" AllelicCNV -pre S4_gatk4_cnv -S S4_tumor.seg -TN S4_tumor.pn.tsv -THET tumor.S4.hets.tsv
. The mean target coverage for normal samples are around 80, while S4 tumor sample is a bit lower, 48.@wleeidt It looks like your command lines are in order and the coverage sounds reasonable. Perhaps you can try running the normal through AllelicCNV to see if you get reasonable minor-allele fractions around 0.5, as a sanity check? Could it be reasonable that your tumor is high purity and exhibits a genome-wide CNV duplicating one of the alleles (since the MAFs are suspiciously close to 1/3), or could it be contaminated by another normal? It's hard for me to say anything conclusive without examining your data in detail, but the tool appears to be fitting the data you're giving it as expected.
I ran the normal sample and the minor-allele fractions are mostly 0.5 (except for chromosomes 1, 5, and 8). Please see attached plot. The tumor sample that I have is just a reference sample with known CNVs (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753).
@wleeidt The normal sample looks very reasonable (at least in terms of hets and minor-allele fractions; it's possible that any activity in the copy ratio is being denoised away, since your normal is included in your PoN---which we don't recommend!)
Just to double check, the normal and tumor samples S1 and S4 are matched?
Also, regarding your AllelicCNV error, this is a bug that we've encountered previously and should be fixed (see https://github.com/broadinstitute/gatk-protected/issues/804). It's possible that you've encountered another edge case and we may have to revise the fix. What version of the jar are you running?
@slee. We only have two normal samples, which I used them to make the PoN, and I used one of the two normal samples for the CNV analysis with a tumor sample. What kind of normal samples would you recommend if I don't use the "normal" that I have in the CNV analysis. Also, for this dataset, the samples are not exactly tumor/normal matched pair. The normal sample is the Horizon's AKT1 Wild Type Reference Standard (https://www.horizondiscovery.com/akt1-wild-type-reference-standard-hd659) while the "tumor" sample is the Horizon's Structural Multiplex Reference Standard (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753). Essentially, I am trying to have known CNV references to see if we can detect them. I also have tumor/normal matched dataset analysis results, which is very noisy (see attached). Do you know of any GIAB type of dataset that we can use to do benchmarking?
@wleeidt I see---for the allelic workflow, the samples must be matched. The purpose of the GetBayesianHetCoverage tool is to identify het sites in the normal and to collect the corresponding allelic counts at these sites in the tumor, which allows the minor-allele fraction to be estimated from the change in the alt-allele fraction.
As for the issue of which normal samples to use to build a PoN, let me clarify: the purpose of the PoN is to use normal samples to learn patterns of systematic sequencing noise, which is essentially done using PCA. However, it's possible that any CNVs in the normal samples, if they contribute enough to the covariance of the data, may get picked up as sequencing noise. When you have only two normals in the PoN, any activity in them may indeed contribute significantly to the covariance!
In our sanity check, we used only two normals to build the PoN and then tried to use the PoN to denoise the copy-ratio data from one of those normals. So it's likely that CNV activity got "denoised" away in the copy-ratio result for this normal, but that it would still show up in the allele-fraction data---this is a possible explanation for the behavior we see in chromosomes 1, 5, and 8.
If we had really intended to perform a proper analysis on that normal sample (rather than just a sanity check of the allele-fraction result), a better approach would have been to build a PoN of many other normal samples (as @shlee said earlier, we recommend at least 10s of samples), not including the normal of interest, and then using these to denoise said normal.
As for benchmarking, we often use TCGA samples with matched WES/WGS to check concordance. There are also some cell lines, such as HCC1143 and COLO829, which have CNV call sets. However, I think a somatic benchmark on the level of the GIAB datasets is still not available.
@wleeidt Also, if you could let me know the version you ran, I can look into that AllelicCNV bug you encountered.
@slee. Thanks for the explanation. I was using gatk-4.beta.1.
@wleeidt OK, thanks. Looks like we'll have to amend the bug fix. Thanks for catching it!
@wleeidt Just to let you know, I amended the bug fix in https://github.com/broadinstitute/gatk/pull/3378. Please let me know if that fixes your error on that sample, when you get a chance.
@EADG
Hi EADG,
Can you tell us which version of the tool you are using?
One of the developers says this should not be an issue with the latest release. They recently made a change that should prevent this from happening (set the locale to English/US before running the tool).
-Sheila
Running the AllelicCNV tool in gatk-4.beta.3-SNAPSHOT, I'm getting a dynamic library linking error:
snipped out normal output
Any ideas?
Issue · Github
by Sheila
Hi @Sheila,
i try to get the correct version...
But i think it is one before the lattest. I will try the latest version and check if the error persist.
@ameynert
Hi,
I have asked someone else on the team to respond. They will get back to you soon.
-Sheila
Hi @ameynert,
It appears that you are using the jar directly by invoking it with
java -jar
. Can you try your command via the gatk-launch script, e.g.gatk-launch AllelicCNV ...
and let me know if you get the same error? Also, can you tell me the reference you are using, e.g. GRCh38? Thanks.Hi @Sheila, @shlee
the new version (version 4.beta.3 ) fixed the issue, thank for your help.
By the way i see the same behavior/issue with Picard MergeVcfs in the latest picard-version (Picard version: 2.10.9-SNAPSHOT)...
Greetings EADG
Hi @EADG,
Which issue do you speak of?
The point and comma issue with CalculateTargetCoverage which was coming from a different language setting...
I also observed the issue with MergeVcfs in Picard when I combine mutiple vcf on a maschine with german language settings..
Hi @EADG,
Does your previously stated solution:
work towards your MergeVcfs issue?
Hi @shlee,
yes it does but maybe the developers can fix this in coming versions of picard...
Soon you'll be able to invoke Picard from GATK4 so this will be solved that way.
Yeah looking forward to it
Hi @shlee. I'm running using gatk-launch and it's assembly hg38.
Hi @ameynert,
My best guess is your environment's Java is choking on GRCh38 contigs, e.g. those that contain colons
:
and/or asterisks*
. Here is a command that I know folks in the research community use that works:This was shared with me last week and it is the first time I've seen folks specify the java jni path. I'm not sure if this would make a difference for your case, but perhaps you could try it.
Otherwise, can you do me a favor and try your command with the official GATK Docker? The Docker image is
broadinstitute/gatk:4.beta.3
. You can invoke the launch script with/gatk/gatk-launch
. Currently, our official WDL scripts use the jar. Here's an example: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/somatic/cnv_somatic_allele_fraction_pair_workflow.wdl. But we recommend invoking the program with thegatk-launch
script.If you need simple instructions on how to start using Docker locally, then let me know.
P.S. With
gatk-launch
, I believe you specify java options as--javaOptions "-Djava.library.path=/usr/lib/jni/"
.P.P.S. I am trying to find my reply to another forum member about why you should exclude such alt and decoy contigs from CNV analyses. Alas, I cannot find it. Please let me know if you need me to expand on why.
Hi,
I have been using gatk4 alpha for copy number analysis. I created PoN and did segmentation. Now I want to check for allelic CN which requires heterozygous sites. For now I am using germline heterozygous variants. is it okay ? Are there any suggestion on selecting variants ?
Sorry for cross-posting. (https://gatkforums.broadinstitute.org/gatk/discussion/10164/gatk4-allelic-cnv#latest)
Hi @anand_mt,
I believe you provide GetHetCoverage a list of common SNP sites with the
--snpIntervals
argument and the tool should derive sites of heterozygous variants for you from the normal BAM. These sites are written to the--normalHets
file that you name.Hi @ameynert,
I've written up simple instructions on getting started with Docker at https://gatkforums.broadinstitute.org/gatk/discussion/10172/. The tutorial is titled (How to) Run the GATK4 Docker locally and take a look inside. The document lives under the gatk-4-beta category at https://gatkforums.broadinstitute.org/gatk/categories/gatk-4-beta on account of GATK4 being in beta status.
Thank you for the reply. Just to make sure, 'common SNP sites' would be germline variants (present in both tumor and normal)?
Hi @anand_mt,
Common SNP sites are where a species shows variation commonly across individuals. So the variation refers to germline variants. You could use a known variant sites resource, e.g. 1000 Genomes Project callset. Alternatively, you could use your Normal's genotype callset, if you have it. The tool uses the callset as a list of genomic positions to check within the BAMs and will do it's own determination of whether the Normal is heterozygous. Having this list allows the tool to be efficient as it doesn't have to comb through each genomic position to assess whether the Normal is het at the position.
Hi @shlee, I've installed Docker and got gatk 4 beta 3 running in an instance. AllelicCNV is still failing for me, though at a different point. I checked, R is installed and looks fine.
Issue · Github
by Geraldine_VdAuwera
Hi @ameynert,
Our developer says that from the standard out:
So we think something is up with your data. Have you any insight about your data that could explain this? Would it be possible for you to share your data so we can investigate further the cause of the error? Instructions for submitting data to our FTP site are in Article#1894.
Hi @shlee - It's ovarian cancer data. Manta analysis shows very high levels of genomic instability. I'll talk to the PI about sharing. I also should note that I only have 10 normal samples at the moment from which to construct a PON, and the somatic CNV analysis done with GATK 4 beta 3 is not very clean as a result. I am hoping that more normal samples will help with this problem (expecting 200 eventually). I did create a PON of 100 female samples from a population study with samples sequenced on the same machines, but the resulting CNV calls were considerably worse.
Hi @ameynert,
Given these are human subject data, if you do get approval from your PI to share, I think the course of action is to share with us on FISMA-compliant FireCloud. Our developers are keen on understanding what is going on with your data.
Are you using normals for the PoN that use the same sample prep, e.g. capture kit? It is insufficient to have normals sequenced just from the same machines. I believe a lot of variance comes from sample preparation and this is the key covariate to control for in lieu of a large PoN. This is something I have asked our developers--the difference between a carefully selected and limited PoN-set and a larger PoN-set that adds more dimensions to the variance. The answer is that our algorithms are designed to account for added dimensions in the PCA projection. The thing to note is that there is a different linear range towards improvement depending on the capture kit. E.g. I have in my notes that for ICE capture 150 samples in the PoN is where improvement plateaus, versus for Agilent 300 samples in the PoN is where improvement plateaus (empirical observations in our hands) . Seems your 100-sample PoN falls short of this range and your choice is to (i) add more samples to the PoN (which you are planning on) or (ii) carefully select samples for a limited-PoN.
Like any experimental control, selecting samples for the PoN may be the most challenging part of the workflow. However, it is worth spending time on for the improved results it can give.
Users have reported a bug in CombineReadCounts that occurs when attempting to combine the same sample sets from different batches that
The solution to this is to rename samples and @mcadosch gives a link in https://gatkforums.broadinstitute.org/gatk/discussion/comment/41704#Comment_41704 that shows one approach. Many thanks for their solution.
Hi @shlee, thanks for the detailed answer. The samples are WGS, but I agree the sample source seems to have a strong effect. I'm in the process of confirming with the PI about sharing the read counts and common het SNP info with you.
Hi @shlee ,
I have been using GATK4 beta to all CNVs. At step 6, I am wondering what's the threshold for CallSegments to make CNV calls? It seems that there is no document available for CallSegments.
Many thanks in advance.
Issue · Github
by Sheila
@liupeng2117
Hi,
Soo Hee is away at a conference. I will have her or someone else on the team get back to you. It may take a few days.
-Sheila
Hi @Sheila,
Thank you! Another question, I am using whole exome sequencing(WES) data to do CNV calling. For the target intervals list, shall we use each entire exon as an interval, or split the exon into, say <=100 bp regions as target intervals? Which is better and more powerful?
Hi @liupeng2117,
Thanks for your questions. The caller in CallSegments uses a relatively naive method to call events that is exactly ported from a previous algorithm developed by Broad CGA, ReCapSeg. If you're interested in the details, we unfortunately don't have convenient documentation, but you may want to look at the relevant source code: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/exome/ReCapSegCaller.java You can see that the threshold for calling is based on taking the segments with mean log2 copy ratio below 0.1 as neutral and determining the standard deviation of the points in these segments.
For the upcoming release, we are making significant method-related and performance-related improvements to the CNV pipeline (including using a new segmentation algorithm, rather than CBS---if you're interested in many more details, you can see https://github.com/broadinstitute/gatk/issues/2858). These improvements will also include a more principled event caller. Stay tuned!
As for your second question, this new pipeline will include a simple tool called PreprocessIntervals that will do exactly what you ask---you'll be able to increase the resolution of your WES analyses by binning exons into smaller regions. We will run some internal evaluations to find an optimal binning size (since we will want to strike a balance between resolution and increasing the variance of the read count in each bin) and give some default recommendations, but this is ultimately a parameter that should be decided based on the particular goals of each analysis.
Right now, a version of the tool is already in the latest version of the GATK, but we have not yet integrated it with the rest of the new CNV pipeline. Notably, the new tool produces a Picard interval list. However, if you'd like to go ahead and try to use the tool with the current CNV pipeline, you can manually convert back to the target-file format expected by that pipeline.
Thanks,
Samuel
Hi @slee, thanks for your detailed answer. Great to know a new version of the tool is releasing. Can't wait to try the new pipeline!
Hi
I tried to call somatic CNV for Mouse tumor cell line data with matched normal strain samples. The genetic background of these normal strain samples actually are much different. Can I summarized them to build PON although they are sequenced with the same technology and batch?
I think the genetic differences of normal samples might be captured as the sequencing errors and cause bias in the somatic CNV calling of the corresponding tumor cell lines. It is right? Is it better to use tumor Vs. matched normal pair to call CNV in this kind of case?
Does GATK also provide CNV calling functions for tumor - normal pairs ? If not, do you have any suggestion that how can I handle it? Many thanks!
Issue · Github
by Sheila
@LanC
Hi,
I will confirm with the team and get back to you.
-Sheila
Hi,
I am using beta5 local jar from the download button on download page, but it seems CallCNLoHAndSplits is not in the release. Is it replaced Or it would be released in new beta version?
thanks.
Issue · Github
by Sheila