We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

(How to part I) Sensitively detect copy ratio alterations and allelic segments

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited May 2018 in Tutorials

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

image This workflow is broken into two tutorials. You are currently on the first part.

The tutorial outlines steps in detecting copy ratio alterations, more familiarly copy number variants (CNVs), as well as allelic segments in a single sample using GATK4. The tutorial (i) denoises case sample alignment data against a panel of normals (PoN) to obtain copy ratios (Tutorial#11682) and (ii) models segments from the copy ratios and allelic counts (Tutorial#11683). The latter modeling incorporates data from a matched control. The same workflow steps apply to targeted exome and whole genome sequencing data.

Tutorial#11682 covers sections 1–4. Section 1 prepares a genomic intervals list with PreprocessIntervals and collects read coverage counts across the intervals. Section 2 creates a CNV PoN with CreateReadCountPanelOfNormals using read coverage counts. Section 3 denoises read coverage data against the PoN with DenoiseReadCounts using principal component analysis. Section 4 plots the results of standardizing and denoising copy ratios against the PoN.

Tutorial#11683 covers sections 5–8. Section 5 collects counts of reference versus alternate alleles with CollectAllelicCounts. Section 6 incorporates copy ratio and allelic counts data to group contiguous copy ratio and allelic counts segments with ModelSegments using kernel segmentation and Markov-chain Monte Carlo. The tool can also segment either copy ratio data or allelic counts data alone. Both types of data together refine segmentation results in that segments are based on the same copy ratio and the same minor allele fraction. Section 7 calls amplification, deletion and neutral events for the segmented copy ratios. Finally, Section 8 plots the results of segmentation and estimated allele-specific copy ratios.

Plotting is across genomic loci on the x-axis and copy or allelic ratios on the y-axis. The first part of the workflow focuses on removing systematic noise from coverage counts and adjusts the data points vertically. The second part focuses on segmentation and groups the data points horizontally. The extent of grouping, or smoothing, is adjustable with ModelSegments parameters. These adjustments do not change the copy ratios; the denoising in the first part of the workflow remains invariant in the second part of the workflow. See Figure 3 of this poster for a summary of tutorial results.

► The official GATK4 workflow is capable of running efficiently on WGS data and provides much greater resolution, up to ~50-fold more resolution for tested data. In these ways, GATK4 CNV improves upon its predecessor workflows in GATK4.alpha and GATK4.beta. Validations are still in progress and therefore the workflow itself is in BETA status, even if most tools, with the exception of ModelSegments, are production ready. The ModelSegments tool is still in BETA status and may change in small but significant ways going forward. Use at your own risk.

► The tutorial skips explicit GC-correction, an option in CNV analysis. For instructions on how to correct for GC bias, see AnnotateIntervals and DenoiseReadCounts tool documentation.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

  1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts
    1.1 How do I view HDF5 format data?
  2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals
  3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts
  4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios
    4.1 Compare two PoNs: considerations in panel of normals creation
    4.2 Compare PoN denoising versus matched-normal denoising
  5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts
    5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?
  6. Group contiguous copy ratios into segments with ModelSegments
  7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments
  8. Plot modeled segments and allelic copy ratios with PlotModeledSegments
    8.1 Some considerations in interpreting allelic copy ratios
    8.2 Some results of fine-tuning smoothing parameters

Tools involved

  • GATK or later releases.
  • The plotting tools require particular R components. Options are to install these or to use the broadinstitute/gatk Docker. In particular, to match versions, use the broadinstitute/gatk: version.

Download example data

Download tutorial_11682.tar.gz and tutorial_11683.tar.gz, either from the GoogleDrive or from the FTP site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data, see Tutorial#11136's third footnote and [1].

Alternatively, download the spacecade7/tutorial_11682_11683 docker image from DockerHub. The image contains GATK4.0.1.1 and the data necessary to run the tutorial commands, including the GRCh38 reference. Allocation of at least 4GB memory to Docker is recommended before launching the container.

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

Before collecting coverage counts that forms the basis of copy number variant detection, we define the resolution of the analysis with a genomic intervals list. The extent of genomic coverage and the size of genomic intervals in the intervals list factor towards resolution.

Preparing a genomic intervals list is necessary whether an analysis is on targeted exome data or whole genome data. In the case of exome data, we pad the target regions of the capture kit. In the case of whole genome data, we divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.

For the tutorial exome data, we provide the capture kit target regions in 1-based intervals and set --bin-length to zero.

gatk PreprocessIntervals \
    -L targets_C.interval_list \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/targets_C.preprocessed.interval_list

This produces a Picard-style intervals list targets_C.preprocessed.interval_list for use in the coverage collection step. Each interval is expanded 250 bases each on either side.

Comments on select parameters

  • The -L argument is optional. If provided, the tool expects the intervals list to be in Picard-style as described in Article#1319. The tool errs for other formats. If this argument is omitted, then the tool assumes each contig is a single interval. See [2] for additional discussion.
  • Set the --bin-length argument to be appropriate for the type of data, e.g. default 1000 for whole genome or 0 for exomes. In binning, an interval is divided into equal-sized regions of the specified length. The tool does not bin regions that contain Ns. [3]
  • Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
  • The --reference or -R is required and implies the presence of a corresponding reference index and a reference dictionary in the same directory.
  • To change the padding interval, specify the new value with --padding. The default value of 250 bases was determined to work well empirically for TCGA targeted exome data. This argument is relevant for exome data, as binning without an intervals list does not allow for intervals expansion. [5]

Take a look at the intervals before and after padding.


For consecutive intervals less than 250 bases apart, how does the tool pad the intervals?

Now collect raw integer counts data. The tutorial uses GATK4.0.1.1's CollectFragmentCounts, which counts coverage of paired end fragments. The tool counts once per fragment overlapping at its center with the interval. In GATK4.0.3.0, CollectReadCounts replaces CollectFragmentCounts. CollectReadCounts counts reads that overlap the interval.

The tutorial has already collected coverage on the tumor case sample, on the normal matched-control and on each of the normal samples that constitute the PoN. To demonstrate coverage collection, the following command uses the small BAM from Tutorial#11136’s data bundle [6]. The tutorial does not use the resulting file in subsequent steps. The CollectReadCounts command swaps out the tool name but otherwise uses identical parameters.

gatk CollectFragmentCounts \
    -I tumor.bam \
    -L targets_C.preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/tumor.counts.hdf5

In the tutorial data bundle, the equivalent full-length result is hcc1143_T_clean.counts.hdf5. The data tabulates CONTIG, START, END and raw COUNT values for each genomic interval.

Comments on select parameters

  • The -L argument interval list is a Picard-style interval list prepared with PreprocessIntervals.
  • The -I input is alignment data.
  • By default, data is in HDF5 format. To generate text-based TSV (tab-separated values) format data, specify --format TSV. The HDF5 format allows for quicker panel of normals creation.
  • Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
  • The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter, FirstOfPairReadFilter, ProperlyPairedReadFilter and MappingQualityReadFilter. [7]

☞ 1.1 How do I view HDF5 format data?

See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in this tutorial.

back to top

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

In creating a PoN, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD, 1), a type of Principal Component Analysis (PCA, 1, 2, 3). The normal samples in the PoN should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise.

The tutorial has already created a CNV panel of normals using forty 1000 Genomes Project samples. The command below illustrates PoN creation using just three samples.

gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.counts.hdf5 \
    -I HG00733.alt_bwamem_GRCh38DH.20150826.PUR.exome.counts.hdf5 \
    -I NA19654.alt_bwamem_GRCh38DH.20150826.MXL.exome.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O sandbox/cnvponC.pon.hdf5

This generates a PoN in HDF5 format. The PoN stores information that, when applied, will (i) standardize case sample counts to PoN median counts and (ii) remove systematic noise in the case sample.

Comments on select parameters

  • Provide integer read coverage counts for each sample using -I. Coverage data may be in either TSV or HDF5 format. The tool will accept a single sample, e.g. the matched-normal.
  • The default --number-of-eigensamples or principal components is twenty. The tool will adjust this number to the smaller of twenty or the number of samples the tool retains after filtering. In general, denoising against a PoN with more components improves segmentation, but at the expense of sensitivity. Ideally, researchers should perform a sensitivity analysis to choose an appropriate value for this parameter. See this related discussion.
  • To run the tool using Spark, specify the Spark Master with --spark-master. See Article#11245 for details.

Comments on filtering and imputation parameters, in the order of application

  1. The tutorial changes the --minimum-interval-median-percentile argument from the default of 10.0 to a smaller value of 5.0. The tool filters out targets or bins with a median proportional coverage below this percentile. The median is across the samples. The proportional coverage is the target coverage divided by the sum of the coverage of all targets for a sample. The effect of setting this parameter to a smaller value is that we retain more information.
  2. The --maximum-zeros-in-sample-percentage default is 5.0. Any sample with more than 5% zero coverage targets is filtered.
  3. The --maximum-zeros-in-interval-percentage default is 5.0. Any target interval with more than 5% zero coverage across samples is filtered.
  4. The --extreme-sample-median-percentile default is 2.5. Any sample with less than 2.5 percentile or more than 97.5 percentile normalized median proportional coverage is filtered.
  5. The --do-impute-zeros default is set to true. The tool takes zero coverage regions and changes these values to the median of the non-zero values. The tool additionally normalizes zero values below the 0.10 percentile or above the 99.90 percentile to.
  6. The --extreme-outlier-truncation-percentile default is 0.1. The tool takes any proportional coverage below the 0.1 percentile or above the 99.9 percentile and sets it to the corresponding percentile value.

The current filtering and imputation parameters are identical to that in the BETA release of the CNV workflow and may change for later versions based on evaluations. The implementation has been made to be more memory efficient so that the tool runs faster than the BETA release.

If the data are not uniform, e.g. has many intervals with zero or low counts, the tool gives the warning to adjust filtering parameters and stops the run. This may happen, for example, if one attempts to construct a panel of mixed-sex samples and includes the allosomal contigs [8]. In this case, first be sure to either exclude allosomal contigs via a subset intervals list or subset the panel samples to those expected to have similar coverage across the given contigs, e.g. panels of the same sex. If the warning still occurs, then adjust --minimum-interval-median-percentile to a larger value. See this thread for the original discussion.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers? Could PCA account for GC-bias?
What do you know about the 1000 Genome Project? Specifically, the exome data?
How could we tell a good PoN from a bad PoN? What control could we use?

In a somatic analysis, what is better for a PoN: tissue-matched normals or blood normals?
Should we include our particular tumor’s matched normal in the PoN?

back to top

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

Provide DenoiseReadCounts with counts collected by CollectFragmentCounts and the CNV PoN generated with CreateReadCountPanelOfNormals.

gatk --java-options "-Xmx12g" DenoiseReadCounts \
    -I hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals cnvponC.pon.hdf5 \
    --standardized-copy-ratios sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios sandbox/hcc1143_T_clean.denoisedCR.tsv

This produces two files, the standardized copy ratios hcc1143_T_clean.standardizedCR.tsv and the denoised copy ratios hcc1143_T_clean.denoisedCR.tsv that each represents a data transformation. In the first transformation, the tool standardizes counts by the PoN median counts. The standarization includes log2 transformation and normalizing the counts data to center around one. In the second transformation, the tool denoises the standardized copy ratios using the principal components of the PoN.

Comments on select parameters

  • Because the default --number-of-eigensamples is null, the tool uses the maximum number of eigensamples available in the PoN. In section 2, by using default CreateReadCoundPanelOfNormals parameters, we capped the number of eigensamples in the PoN to twenty. Changing the --number-of-eigensamples in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. See this thread for detailed discussion.
  • Additionally provide the optional --annotated-intervals generated by AnnotateIntervals to concurrently perform GC-bias correction.

back to top

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

We plot the standardized and denoised read counts with PlotDenoisedCopyRatios. The plots allow visually assessing the efficacy of denoising. Provide the tool with both the standardized and denoised copy ratios from the previous step as well as a reference sequence dictionary.

gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces six files in the plots directory--two PNG images and four text files as follows.

  • hcc1143_T_clean.denoised.png plots the standardized and denoised read counts across the contigs and scales the y-axis to accommodate all copy ratio data.
  • hcc1143_T_clean.denoisedLimit4.png plots the same but limits the y-axis range from 0 to 4 for comparability across samples.

Each of the text files contains a single quality control value. The value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number events and should decrease after denoising.

  • hcc1143_T_clean.standardizedMAD.txt gives the MAD for standardized copy ratios.
  • hcc1143_T_clean.denoisedMAD.txt gives the MAD for denoised copy ratios.
  • hcc1143_T_clean.deltaMAD.txt gives the difference between standardized MAD and denoised MAD.
  • hcc1143_T_clean.scaledDeltaMAD.txt gives the fractional difference (standardized MAD - denoised MAD)/(standardized MAD).

Comments on select parameters

  • The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping.
  • To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

Here are the results for the HCC1143 tumor cell line and its matched normal cell line. The normal cell line serves as a control. For each sample are two plots that show the effects of PCA denoising. The upper plot shows standardized copy ratios in blue and the lower plot shows denoised copy ratios in green.

4A. Tumor standarized and denoised copy ratio plots

4B. Normal standarized and denoised copy ratio plots

Would you guess there are CNV events in the normal? Should we be surprised?

The next step is to perform segmentation. This can be done either using copy ratios alone or in combination with allelic copy ratios. In part II, Section 6 outlines considerations in modeling segments with allelic copy ratios, section 7 generates a callset and section 8 shows how to plot segmented copy and allelic ratios. Again, the tutorial presents these steps using the full features of the workflow. However, researchers may desire to perform copy ratio segmentation independently of allelic counts data, e.g. for a case without a matched-control. For the case-only, segmentation gives the following plots. To recapitulate this approach, omit allelic-counts parameters from the example commands in sections 6 and 8.

4C. Tumor case-only copy ratios segmentation gives 235 segments.

4D. Normal case-only copy ratios segmentation gives 41 segments.

While the normal sample shows trisomy of chr2 and a subpopulation with deletion of chr6, the tumor sample is highly aberrant. The extent of aneuploidy is unsurprising and consistent with these HCC1143 tumor dSKY results by Wenhan Chen. Remember that cell lines, with increasing culture time and selective bottlenecks, can give rise to new somatic events, undergo clonal selection and develop population heterogeneity much like in cancer.

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

Denoising with a PoN is critical for calling copy-number variants from targeted exome coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

To understand the impact a PoN's constituents can have on an analysis, compare the results of denoising the normal sample against two different PoNs. Each PoN consists of forty 1000 Genomes Project exome samples. PoN-M consists of the same cohort used in the Mutect2 tutorial's PoN. We selected PoN-C's constituents with more care and this is the PoN the CNV tutorial uses.

4E. Compare standardization and denoising with PoN-C versus PoN-M.

What is the difference in the targets for the two cohorts--cohort-M and cohort-C? Is this a sufficient reason for the difference in noise profiles we observe above?

GATK4 denoises exome coverage profiles robustly with either panel of normals. However, a good panel allows maximal denoising, as is the case for PoN-C over PoN-M.

We use publically available 1000 Genomes Project data so as to be able to share the data and to illustrate considerations in CNV analyses. In an actual somatic analysis, we would construct the PoNs using the blood normals of the tumor cohort(s). We would construct a PoN for each sex, so as to be able to call events on allosomal chromosomes. Such a PoN should give better results than that from either of the tutorial PoNs.

Somatic analyses, due to the confounding factors of tumor purity and heterogeneity, require high sensitivity in calling. However, a sensitive caller can only do so much. Use of a carefully constructed PoN augments the sensitivity and helps illuminate copy number events.

This section is adapted from a hands-on tutorial developed and written by Soo Hee Lee (@shlee) in July of 2017 for the GATK workshops in Cambridge and Edinburgh, UK. The original tutorial uses the GATK4.beta workflow and can be found in the 1707 through 1711 GATK workshops folders. Although the Somatic CNV workflow has changed from GATK4.beta and the official GATK4 release, the PCA denoising remains the same. The hands-on tutorial focuses on differences in PCA denoising based on two different panels of normals (PoNs). Researchers may find working through the worksheet to the very end with either release version beneficial, as considerations in selecting PoN constituents remain identical.

Examining the read group information for the samples in the two PoNs shows a difference in mixtures of sequencing centers--four different sequencing centers for PoN-M versus a single sequencing center for PoN-C. The single sequencing center corresponds to that of the HCC1143 samples. Furthermore, tracing sample information will show different targeted exome capture kits for the sequencing centers. Comparing the denoising results of the two PoNs stresses the importance of selective PoN creation.

☞ 4.2 Compare PoN denoising versus matched-normal denoising

A feature of the GATK4 CNV workflow is the ability to normalize a case against a single control sample, e.g. a tumor case against its matched normal. This involves running the control sample through CreateReadCountPanelOfNormals, then denoising the case against this single-sample projection with DenoiseReadCounts. To illustrate this approach, here is the result of denoising the HCC1143 tumor sample against its matched normal. For single-sample matched-control denoising, DenoiseReadCounts produces identical data for standardizedCR.tsv and denoisedCR.tsv.

4F. Tumor case standardized against the normal matched-control

Compare these results to that of section 4.1. Notice the depression in chr2 copy ratios that occurs due to the PoN normal sample's chr2 trisomy. Here, the median absolute deviation (MAD) of 0.149 is an incremental improvement to section 4.1's PoN-M denoising (MAD=0.15). In contrast, PoN-C denoising (MAD=0.125) and even PoN-C standardization alone (MAD=0.134) are seemingly better normalization approaches than the matched-normal standardization. Again, results stress the importance of selective PoN creation.

The PoN accounts for germline CNVs common to its constituents such that the workflow discounts the same variation in the case. It is possible for the workflow to detect germline CNVs not represented in the PoN, in particular, rare germline CNVs. In the case of matched-normal standardization, the workflow should discount germline CNVs and reveal only somatic events.

The workflow does not support iteratively denoising two samples each against a PoN and then against each other.

The tutorial continues in a second document at #11683.

back to top


[1] The constituents of the forty sample CNV panel of normals differs from that of the Mutect2 panel of normals. Preliminarly CNV data was generated with v4.0.1.1 somatic CNV WDL scripts run locally on a Gcloud Compute Engine VM with Cromwell v30.2. Additional refinements were performed on a 16GB MacBook Pro laptop. Additional plots were generated using a broadinstitute/gatk: Docker container. Note the v4.0.1.1 WDL script does not allow custom sequence dictionaries for the plotting steps.

[2] Considerations in genomic intervals are as follows.

  • For targeted exomes, the intervals should represent the bait capture or target capture regions.
  • For whole genomes, either supply regions where coverage is expected across samples, e.g. that exclude alternate haplotypes and decoy regions in GRCh38 or omit the option for references where coverage is expected for the entirety of the reference.
  • For either type of data, expect to modify the intervals depending on (i) extent of masking in the reference used in read mapping and (ii) expectations in coverage on allosomal contigs. For example, for mammalian data, expect to remove Y chromosome intervals for female samples.

[3] See original discussion on bin size here. The bin size determines the resolution of CNV breakpoints. The theoretical limit depends on coverage depth and the insert-size distribution. Typically bin sizes on the order of the read length will give reasonable results. The GATK developers have tested WGS runs where the bin size is as small as 250 bases.

[4] Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. The default is set to ALL for GATK4.0.1.1. For future versions, the default will be set to OVERLAPPING_ONLY.

[5] The tool allows specifying both the padding and the binning arguments simultaneously. If exome targets are very long, it may be preferable to both pad and break up the intervals with binning. This may provide some additional resolution.

[6] The data bundle from Tutorial#11136 contains tumor.bam and normal.bam. These tumor and normal samples are identical to that in the current tutorial and represent a subset of the full data for the following regions:

chr6    29941013    29946495    +    
chr11   915890  1133890 +    
chr17   1   83257441    +    
chr11_KI270927v1_alt    1   218612  +    
HLA-A*24:03:01  1   3502    +

[7] The following regarding read filters may be of interest and apply to the workflow illustated in the tutorial that uses CollectFragmentCounts.

  • In contrast to prior versions of the workflow, the GATK4 CNV workflow excludes duplicate fragments from consideration with the NotDuplicateReadFilter. To instead include duplicate fragments, specify -DF NotDuplicateReadFilter.
  • The tool only considers paired-end reads (0x1 SAM flag) and the first of pair (0x40 flag) with the FirstOfPairReadFilter. The tool uses the first-of-pair read’s mapping information for the fragment center.
  • The tool only considers properly paired reads (0x2 SAM flag) using the ProperlyPairedReadFilter. Depending on whether and how data was preprocessed with MergeBamAlignment, proper pair assignments can differ from that given by the aligner. This filter also removes single ended reads.
  • The MappingQualityReadFilter sets a threshold for alignment MAPQ. The tool sets --minimum-mapping-quality to 30. Thus, the tool uses reads with MAPQ30 or higher.

[8] The current tool version requires strategizing denoising of allosomal chromosomes, e.g. X and Y in humans, against the panel of normals. This is because coverage will vary for these regions depending on the sex of the sample. To determine the sex of samples, analyze them with DetermineGermlineContigPloidy. Aneuploidy in allosomal chromosomes, much like trisomy, can still make for viable organisms and so phenotypic sex designations are insufficient. GermlineCNVCaller can account for differential sex in data.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

back to top

Post edited by shlee on


  • FPBarthelFPBarthel HoustonMember ✭✭

    Are there any recommendations in terms of CNV PON construction, and how are these different/similar to Mutect2 PON construction? What is the minimum PON size? What if there is a massive gender imbalance, eg. 3 females vs 17 males, is it better to create seperate PON for males and females or to drop sex chromosomes from interval_list and create a single PON? Any considerations for potential technical artifacts, say if you have multiple batches of samples sequenced by different vendors?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @FPBarthel,

    The recommendation for CNV PoN construction is to select your samples carefully. It would not be unreasonable for the preprocessing to make a good PoN for your sample of interest to take the majority of analysis time.

    The factors to consider are different from Mutect2 PoN construction, where the PoN sites are used to blacklist sites from consideration. With CNV, the PoN is the baseline for what will be considered normal coverage.

    Technically, the PoN creation tool will accept a single sample. But I think you are asking what is the minimum recommended size for a PoN. This should depend on factors such as (i) average depth of coverage and (ii) the extent of random vs. systematic noise etc. In general, targeted exomes come with more noise than WGS, so this should be taken into consideration.

    As for your question:

    What if there is a massive gender imbalance, eg. 3 females vs 17 males, is it better to create seperate PON for males and females or to drop sex chromosomes from interval_list and create a single PON?

    In this case, I have heard mentioned by a developer that one practice is to analyze separately, i.e. create separate PoNs for the allosomal contigs. This way, you can analyze all of your autosomal contigs across the two sexes together.

    Any considerations for potential technical artifacts, say if you have multiple batches of samples sequenced by different vendors?

    Yes, you should consider the variance between the batches, especially if your samples are targeted exomes.

  • sutturkasutturka Member

    Thank you for providing this comprehensive tutorial. We are working with the canine tumor WGS data and we have limited availability of the "Normal Data".

    For few tumors we were able to collect the matched-normal samples and planning to use the single matched-normal sample as PON. For tumor-only sample we are planning to use public canine data to prepare PON. Public canine samples have average WGS coverage in the range of 10 to 20X while our tumor samples are sequenced at 80X WGS.

    What is your opinion about using 10-20X coverage data as PON vs 80X tumor? Since there is a limited data availability, do you have any suggestions in terms of parameter optimization for WGS PON samples with low coverage?

    In PON preparation, the parameter --minimum-interval-median-percentile 5.0 is specific to exome data? Should we keep this option as default for WGS data?

    Also, do you have suggestions or links for the previous posts which specifically talks about interpretation of the CNV plots in terms of overlapping genes? Do you have any suggestions for conversion of CNV output files to GISTIC format?

  • sleeslee Member, Broadie, Dev ✭✭✭

    @sutturka If all of your normal and tumor samples were sequenced together or with the same protocol, I would rather use all of your normal samples to build a single PoN to denoise all of your tumor samples. The idea of the PoN in the CNV workflow is to capture systematic biases in coverage, which may differ from protocol to protocol more strongly than they vary across samples sequenced with the same protocol. If all of your samples have similar coverage, hopefully this answers your second question about using low-coverage normals to denoise high-coverage tumors as well.

    As for the --minimum-interval-median-percentile parameter, this is a filter that may need to be varied depending on the coverage statistics of the particular bins you choose to collect coverage on. Unfortunately, this means there is no one-size-fits-all recommendation. However, if your choice is not stringent enough, it is likely that the CreateReadCountPanelOfNormals tool will complain that the PCA does not converge. For hg19 WGS with bin sizes of 250bp, I believe that cutoffs in the 5-10% range suffice.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2018

    Hi @sutturka,

    I will see if the developers have any insight to share. In the meanwhile, let's consider some factors.

    In CNV analysis, the quality and matchingness of the PON can greatly impact results. Given we can expect WGS coverage to be more even, I think your approach of using public WGS data could still give good results. The one thing I would be careful of is matching library complexity. For one, consider if the public samples are PCR+ (likely if data is from years ago) vs. PCR-free (mostly the case in recent days) and whether this matches your data. PCR-free is the better scenario. ADDENDUM: Also keep in mind the type of sequencing platform used, e.g. patterned flow cells vs. traditional flow cells.

    The bins that you use in your analysis and for coverage collection from PoN samples must be identical. However, 10–20x is shallow and mismatches by 4-8x from your case samples. The implication of this is that for each bin, you have fewer counts and so any statistical noise gets amplified and the tools may find it hard to then detect the systematic biases.

    Here are some options to consider.

    A. One option might be to use your few matched-normals in generating a PoN. WGS should allow for smaller cohort PoNs than exomes. This would avoid introducing batch effects, or at lease minimize batch effects, assuming your matched-normals were batch processed with the tumor case samples during some or all parts of sample prep and sequencing.

    B. Coverage depth becomes less of an issue as depths increase. In this case, you could (i) increase bin size or (ii) increase the apparent depth per sample. For the latter, and here I'm just enumerating (do wait to see what the developers have to say), pool the shallow-depth samples to make pseudo-samples with greater coverage depth (and complexity). Towards this, I imagine a clustering step to see which samples are most like each other and to pool according to likeness to preserve biologically relevant variation.

    Note that CreateReadCountPanelOfNormals has a number of filtering steps to squeeze out the usable data across samples and across bins. I outline these in the tutorial and they will be noted in the tool stdout. It might be worth considering the impact of each o these filtering steps on your normals cohort.

    You ask:

    In PON preparation, the parameter --minimum-interval-median-percentile 5.0 is specific to exome data? Should we keep this option as default for WGS data?

    The default of this parameter is actually 10.0.

    With this parameter, the tool filters out targets or bins with a median fractional coverage below this percentile. The median is across the samples. The fractional coverage is the target coverage divided by the sum of the coverage of all targets for a sample.

    In the tutorial, we change this for our carefully curated PoN. For preliminary analyses, the default is a good start and this should still apply to WGS data. If you set this to too small of a value (to retain more data in the PoN), then one effect of this downstream is to have very large copy ratio values (due to a division step that uses medians).

    Also, do you have suggestions or links for the previous posts which specifically talks about interpretation of the CNV plots in terms of overlapping genes? Do you have any suggestions for conversion of CNV output files to GISTIC format?

    Can you explain further what you mean by overlapping genes? You can search the forum and also browse comments by @slee: https://gatkforums.broadinstitute.org/gatk/profile/comments/22599/slee.

    I do not have any suggestions for file compatibility except to point you to GISTIC documentation at https://software.broadinstitute.org/cancer/cga/gistic. Comparing snippets of the formats, at glace it seems it would be easy enough to stitch together a GISTIC SEG file from GATK4 CNV data. The one thing to look into is if the copy ratio is standardized in the same manner, e.g. LOG2.


    Sample  Chromosome  Start Position  End Position    Num markers Seg CN
    secondary_GBM_6 1   328296  79786669    2632    -0.5836395
    secondary_GBM_6 1   79787684    79787955    2   -3.057187
    secondary_GBM_6 1   79791835    141706009   2075    -0.582535
    secondary_GBM_6 1   142661411   245729550   4494    0.1422075
    secondary_GBM_6 2   53452   44596407    1950    0.126455
    secondary_GBM_6 2   44600493    44601507    2   -1.072765
    secondary_GBM_6 2   44620446    174439819   5505    0.141225
    secondary_GBM_6 2   174469892   188713884   770 -0.557449
    secondary_GBM_6 2   188715075   243000317   2125    0.144221
    secondary_GBM_6 3   48603   198541751   7813    0.135339

    CNV .cr.seg:

    chr1    68839   6465331 780     0.154072
    chr1    6465332 6496821 22      -0.828984
    chr1    6519193 13749739        839     -0.157372
    chr1    13770076        13770576        1       -6.510900
    chr1    13772826        19771734        699     -0.148615
    chr1    19780328        23059258        407     0.192938
    chr1    23059259        81991397        4907    -0.133812
    chr1    83869710        84235557        33      0.673129
    chr1    84298306        94121297        728     -0.087338

    CNV .modelFinal.seg:

    chr1    68839   6465331 780     0       0.160315        0.169760        0.177279        NaN     NaN     NaN
    chr1    6465332 6496821 22      0       -0.890681       -0.824216       -0.760293       NaN     NaN     NaN
    chr1    6519193 13749739        839     0       -0.168551       -0.159642       -0.145383       NaN     NaN     NaN
    chr1    13770076        13770576        1       0       -6.765158       -6.471332       -6.112925       NaN     NaN     NaN
    chr1    13772826        19771734        699     0       -0.157224       -0.147047       -0.138442       NaN     NaN     NaN
    chr1    19780328        23059258        407     0       0.197733        0.213542        0.224346        NaN     NaN     NaN
    chr1    23059259        81991397        4907    0       -0.141635       -0.130280       -0.119057       NaN     NaN     NaN
    chr1    83869710        84235557        33      0       0.628804        0.660170        0.714873        NaN     NaN     NaN
    chr1    84298306        94121297        728     0       -0.096587       -0.084086       -0.070801       NaN     NaN     NaN
    Post edited by shlee on
  • SamirSamir Member ✭✭

    Thanks for this excellent documentation. About gistic2, it is something that took a while for us to get it running for dog genome. Here are some tips. GISTIC2 takes segment copy number value as log2(raw value) - 1, so as to convert resulting CN value to 0 for diploid genome, and make values more readable.

  • sutturkasutturka Member

    Thank you @shlee and @sle for the excellent answers. This is very helpful. I was looking for the GISTIC like output which provides the list of amplified/deleted genes along with CNVs which can be plotted through Maftools.

    Thank you @samir for the GISTIC tips. This is extremely helpful.

  • SamirSamir Member ✭✭

    This is not related to primary discussion but thought it's important for gistic2 input format. In my earlier post, I mentioned to convert log2(copy_number_raw_value) - 1. However, that should not be applicable to GATK or most other copy number callers' output as those are typically log2(tumor/normal) ratio and not an actual raw tumor sample copy numbers. log2ratio is what GISTIC2 requires in input segmentation file, and that is equivalent to log2(raw_value) - 1, i.e., log2(T/N) = log2(T) - log2(N) = log2(T) - 1. So, log2 ratio values do not require subtraction by 1Read more at https://www.biostars.org/p/174382/#175590

  • yim6810yim6810 Hong KongMember

    Sorry I am trying to use this workflow to call somatic CNV for my custom panel of only 70 genes. It is actually possible to do this or I have to give up?

    To limit the region and aiming for higher resolution at the captured region only, I tried -L arguments or even producing a custom hg19reference that contains my target region (but it failed at PreprocessInterverals). Are these approach wrong?

    My plots are quite messy.

    Thanks a lot for advising

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    What do you mean by "it failed at PreprocessInterverals"?


  • FPBarthelFPBarthel HoustonMember ✭✭
    edited August 2018

    @shlee said:
    The recommendation for CNV PoN construction is to select your samples carefully. It would not be unreasonable for the preprocessing to make a good PoN for your sample of interest to take the majority of analysis time.

    Any considerations for potential technical artifacts, say if you have multiple batches of samples sequenced by different vendors?

    Yes, you should consider the variance between the batches, especially if your samples are targeted exomes.

    Our samples are whole genomes. Is there harm in combining samples from clearly different batches for making a PON? Eg. take the following batches:

    N. of normals Read Length Instrument Coverage
    Batch A 5 75 HiSeq 40x
    Batch B 10 100 HiSeq 5x
    Batch C 20 100 HiSeq 40x
    Batch D 40 150 NovaSeq 40x

    Batch A and B are too small to make their own PON. Is there any harm in combining normals from all batches and creating a single PON?

    Second related question, going back to your example for a 40-sample good PON vs a 40-sample subpar-PON, if one were to take the 80 samples from both PONs and combined them into one PON would the PON improve, stay the same, or become worse?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I think this article will help.


  • DS89DS89 UKMember

    My sample was sequenced from multiple libraries, so I followed this link from your documentation https://software.broadinstitute.org/gatk/documentation/article.php?id=3060 and using that merged BAM file downstream to BQSR and using that BAM file to 'CollectReadCounts' I get the following error:

    java.lang.IllegalArgumentException: The input header contains more than one unique sample name: 537899_I1_L1, 537899_I1_L2, 537899_I1_L3, 537899_I1_L4, 537899_I1_L5, 537899_I1_L6, 537899_I1_L7, 537899_I1_L8

    The command used is here:
    gatk CollectReadCounts -I 537899_I1-picard.GATK.Realign.bam -L targets_C.preprocessed.interval_list --interval-merging-rule OVERLAPPING_ONLY -O 7102092829_merged.counts.hdf5

    I ran the same command on a BAM file that was processed only on single lane (no merging involved) it all ran fine.

    Please provide an answer to this issue.


  • FPBarthelFPBarthel HoustonMember ✭✭
    edited November 2018

    Hi! Are there any specific recommended settings for dealing with fragmented FFPE samples, with matching normals that are not from FFPE? I suppose it may be an option to increase the bin size? Eg. 10.000bp rather than 1000bp? Is it possible to set a non-zero bin size for exomes?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @FPBarthel,

    I answered a similar question at https://gatkforums.broadinstitute.org/gatk/discussion/10263/mixed-pon-blood-ffpe. We do not have specific recommendations for dealing with FFPE samples for the CNA/CNV workflow. For exomes, as you should use your exome targets as the intervals to pad. Binning is for whole genome sequence samples and 1000 bp is a good place to start for your typical 30x coverage depth WGS samples. This should be upped for lower coverage and could be reduced for higher coverage depth samples.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Yeah I figured this part, so far my experience (so far) is that CNV calls look a bit cleaner (even with 30-60x coverage) if you increase the bin size a bit, at a cost of decreased resolution, but for me confidence is more important so I'm happy to make that sacrifice.

    Do you know whether increasing the --bin-length has any effect when an intervals -L file (capture kit) is provided @shlee? I would be interested in perhaps binning adjacent (not directly adjacent, but nearby) capture regions and treating them as a single bin/single "group" of regions when normalizing. Or would it instead make more sense to increase the --padding parameter?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @FPBarthel, the --padding parameter was designed for exactly what you want to do--increasing the size of the interval around a target capture region.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited November 2018

    @shlee said:
    @FPBarthel, the --padding parameter was designed for exactly what you want to do--increasing the size of the interval around a target capture region.

    Right I understand this, but what does the --bin-length parameter do when -L is provided? The guide and tool documentation suggest that they are mutually exclusive, however I am able to run -L and --bin-length (non-zero) together.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Yes @FPBarthel, if you specify both, both come into play. I have for example used 1000bp bins and left the padding at default. What happens is you get a shift in the bins by the size of the padding.

  • emmalynchenemmalynchen Member
    edited March 2019

    Hi @shlee, I'm currently trying to use Firecloud to run the cnv_somatic_panel_workflow.wdl
    and cnv_somatic_pair_workflow.wdl available here: https://github.com/gatk-workflows/gatk4-somatic-cnvs/blob/master/cnv_somatic_panel_workflow.wdl Looks like this might be the most up to date workflow (recent commit and using CollectReadCounts), although all the snapshots in the public Firecloud method repo uses CollectFragmentCounts?

    I've tried setting up the necessary workspace attributes and input/output connections according to the cnv_somatic_panel_workflow.b37.inputs.json, but it's not clear to me what the data entities should look like and what root entity to specify when launching the analysis. I'm specifying the location of the .bam/.bai files, but maybe I should be putting those in a data entity?

    Workspace attributes:

    cnv_somatic_panel_workflow.wdl input/output:

    Update: Didn't realize I could clone entire workspaces from the public repo - will give that a try.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @emmalynchen,

    Only the earliest versions of the workflow use CollectFragmentCounts. You should be using versions of the pipeline that use CollectReadCounts.

    My colleague Beri informs me the latest Somatic CNV workspace on FireCloud at https://portal.firecloud.org/#workspaces/help-gatk/Somatic-CNVs-GATK4 is based on versioned WDL scripts from Dockstore and not the FireCloud Methods Repository. You can tell this from the Method Configurations tab.

    If you click on the link given by Path, this opens up the pipeline script in Dockstore at https://dockstore.org/workflows/github.com/gatk-workflows/gatk4-somatic-cnvs/cnv_somatic_panel_workflow:1.2.0?tab=info.

    You can tell which version of GATK is in use for the workspace by expanding the Workspace Attributes section. You will see the workspace currently uses GATK4.1.0.0. This uses CollectReadCounts as you can see in the CollectCounts task in the Common Tasks WDL.

    [...] it's not clear to me what the data entities should look like and what root entity to specify when launching the analysis. I'm specifying the location of the .bam/.bai files, but maybe I should be putting those in a data entity?

    If you click on the particular method of interest in the Method Configuration, the featured workspace will give helpful information towards data entities. Here's a part of the panel workflow:

    Here you see the reference calls on workspace attributes. Any attribute with this calls upon data within your data model. You can choose to add your samples to your data model to utilize this useful (and recommended) feature of FireCloud, or, alternatively, you can replace these with direct URLs to your cloud data, e.g. gs://.... The Type field will tell you what format the pipeline expects, e.g. File, Array[File], Array[String], etc.

    My colleague Allie is finalizing the step-by-step documentation on setting up data models. I think this will be helpful to you and the draft is at https://docs.google.com/document/d/158Ej2CMqoVYt-8P_TrZzVwnsialleCYzKSUPsoVElUw/edit?usp=sharing. She says:

    The sections they will want will likely be Linking data to your workspace with data model load tables (starting on page nine) and Advanced topic: Data on Terra starting on page 18.

    We hope this is helpful. Feel free to ask your question in the document as comments.

  • sahiilsethsahiilseth Member

    @shlee thanks for this excellent post and discussion.

    What if there is a massive gender imbalance, eg. 3 females vs 17 males, is it better to create seperate PON for males and females or to drop sex chromosomes from interval_list and create a single PON?

    In this case, I have heard mentioned by a developer that one practice is to analyze separately, i.e. create separate PoNs for the allosomal contigs. This way, you can analyze all of your autosomal contigs across the two sexes together.

    I have a followup question on this. So potentially, I could create separate counts for autosomal and allosomal contigs, for all samples:

    sample123_autosomal.hdfs (all samples)

    And then create two sets of final PON; for each gender - right (using CreateReadCountPanelOfNormals)?

    Also regarding your comment on manual curation for CNV profiles for creating a PON, what kind of metrics do you consider for curation?

    I only have germline (blood), samples from, so not really "healthy normals". I am thinking of using the germline CNV calls to curate a set of PON, based on:

    • nice clean profiles overall
    • no evidence of germline CNVs (like was shown in the tutorial)
    • some samples have noisy depth ratios at telomeres: these can be a good control for the cases, at the same time can increase the variance in PON. Comments really welcome here.
    • I see a few cases where I am not sure of the change I am seeing is a technical artifact or real signal.

    Here is an example of a blood normal:

  • sergey_ko13sergey_ko13 Member

    Dear CN people of this thread,

    I am usually spending weeks on a problem before asking. But if my question still seems too obvious or boring - feel free to ignore )

    I face a challenge to search for, download, align, dedup and CollectReadCounts for 40+ public normal WGSs. And I wonder if anyone has already done that job and generously shared that relatively small hdf5 at any kind of repo along with intervals. Otherwise it seems to me that dozens of us across the globe are querrying the same big fastqs and doing quite same operations on them producing very same hdf5s and deleting them afterwards. This is even environmentally bad )

    I fully understand that every experiment design and platform needs it's own PoN but some are close enough to "borrow". We are having here 5 WGS case samples and downloading/processing 40 more would boost my storage usage.

    If no such public files exist - I'd be happy to have any advise on how to better search for PoN members for my study. We have here HiSeq 4000 2x150 30x WGSs from Breast Cancer cell lines. I first checked 1kgenomes for female fastqs from CEU population - they all look too low covered and I could not track the sequencing platform used - fastq header also tells me nothing, maybe it's too old. Is there any better way to search or sequencing data source for us?


Sign In or Register to comment.