We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

(How to part II) Sensitively detect copy ratio alterations and allelic segments

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited May 2018 in Tutorials

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

image This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.

For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.

The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

  1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts
    1.1 How do I view HDF5 format data?
  2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals
  3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts
  4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios
    4.1 Compare two PoNs: considerations in panel of normals creation
    4.2 Compare PoN denoising versus matched-normal denoising
  5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts
    5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?
  6. Group contiguous copy ratios into segments with ModelSegments
  7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments
  8. Plot modeled segments and allelic copy ratios with PlotModeledSegments
    8.1 Some considerations in interpreting allelic copy ratios
    8.2 Some results of fine-tuning smoothing parameters

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.

Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals [9].

The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 [6]. The tutorial does not use the resulting files in subsequent steps.

Collect counts at germline variant sites for the matched-control

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I normal.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_N_clean.allelicCounts.tsv

Collect counts at the same sites for the case sample

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I tumor.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_T_clean.allelicCounts.tsv

This results in counts table files. Each data file has header lines that start with an @ asperand symbol, e.g. @HD, @SQ and @RG lines, followed by a table of data with six columns. An example snippet is shown.

Comments on select parameters

  • The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
  • The tool requires the reference genome, specified with -R, and aligned reads, specified with -I.
  • As is the case for most GATK tools, the engine filters reads upfront using a number of read filters. Of note for CollectAllelicCounts is the MappingQualityReadFilter. By default, the tool sets the filter's --minimum-base-quality to twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis [10].

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters --minimum-population-allele-frequency and --maximum-population-allele-frequency. Details are here.

CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a --minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the --minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the --minimum-base-quality parameter, whose default is 20.

back to top

6. Group contiguous copy ratios into segments with ModelSegments

ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.

The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis [11]. In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.

For the following command, be sure to specify an existing --output directory or . for the current directory.

gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \
    --output sandbox \
    --output-prefix hcc1143_T_clean

This produces nine files each with the basename hcc1143_T_clean in the current directory and listed below. The param files contain global parameters for copy ratios (cr) and allele fractions (af), and the seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (normal) and the case.

  1. hcc1143_T_clean.modelBegin.seg
  2. hcc1143_T_clean.modelFinal.seg
  3. hcc1143_T_clean.cr.seg
  4. hcc1143_T_clean.modelBegin.af.param
  5. hcc1143_T_clean.modelBegin.cr.param
  6. hcc1143_T_clean.modelFinal.af.param
  7. hcc1143_T_clean.modelFinal.cr.param
  8. hcc1143_T_clean.hets.normal.tsv
  9. hcc1143_T_clean.hets.tsv

The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.

Comments on select parameters

  • The tool accepts both or either copy-ratios (--denoised-copy-ratios) or allelic-counts (--allelic-counts) data. The matched-control allelic counts (--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
  • The --minimum-total-allele-count is set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
  • The --genotyping-homozygous-log-ratio-threshold option is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
  • Default smoothing parameters are optimized for faster performance, given the size of whole genomes. The --maximum-number-of-smoothing-iterations option caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the --number-of-samples-copy-ratio and --number-of-samples-allele-fraction options, respectively. Finally, --number-of-smoothing-iterations-per-fit is set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.

    • GATK4.beta's ACNV set this parameter such that each smoothing iteration refit using MCMC, at the cost of compute. For the tutorial data, which is targeted exomes, the default zero gives 398 segments after two smoothing iterations, while setting --number-of-smoothing-iterations-per-fit to one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
  • For advanced smoothing recommendations, see [12].

Section 8 shows the results of segmentation, the result from changing --number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in [12].

ModelSegments runs in the following three stages.

  1. Genotypes heterozygous sites and filters on depth and for sites that overlap with copy-ratio intervals.
    • Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.
    • If given only allelic counts data, ModelSegments does not apply intervals.
  2. Performs multidimensional kernel segmentation (1, 2).
    • Uses allelic counts within each copy-ratio interval for each contig.
    • Uses denoised copy ratios and heterozygous allelic counts.
  3. Performs Markov-Chain Monte Carlo (MCMC, 1, 2, 3) sampling and segment smoothing. In particular, the tool uses Gibbs sampling and slice sampling. These MCMC samplings inform smoothing, i.e. merging adjacent segments, and the tool can perform multiple iterations of sampling and smoothing [13].
    • Fits initial model. Writes initial segments to modelBegin.seg, posterior summaries for copy-ratio global parameters to modelBegin.cr.param and allele-fraction global parameters to modelBegin.af.param.
    • Iteratively performs segment smoothing and sampling. Fits allele-fraction model [14] until log likelihood converges. This process produces global parameters.
    • Samples final models. Writes final segments to modelFinal.seg, posterior summaries for copy-ratio global parameters to modelFinal.cr.param, posterior summaries for allele-fraction global parameters to modelFinal.af.param and final copy-ratio segments to cr.seg.

At the second stage, the tutorial data generates the following message.

INFO  MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.

At the third stage, the tutorial data generates the following message.

INFO  MultidimensionalModeller - Final number of segments after smoothing: 398

For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.

back to top

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the cr.seg segmentation result from ModelSegments.

gatk CallCopyRatioSegments \
    --input hcc1143_T_clean.cr.seg \
    --output sandbox/hcc1143_T_clean.called.seg

The resulting called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a + plus sign, deletions with a - minus sign and neutral segments with a 0 zero.

Here is a snippet of the results.

Comments on select parameters

  • The parameters --neutral-segment-copy-ratio-lower-bound (default 0.9) and --neutral-segment-copy-ratio-upper-bound (default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s --neutral-segment-copy-ratio-threshold option.

back to top

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

PlotModeledSegments visualizes copy and allelic ratio segmentation results.

gatk PlotModeledSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.hets.tsv \
    --segments hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces plots in the plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in --segments hcc1143_T_clean.modelBegin.seg.

Comments on select parameters

  • The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping [4].
  • To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the -L parameter. To interactively visualize data, consider the following options.

  • Modify the sequence dictionary to contain only the contigs of interest, in the order desired.
  • The bedGraph format for targeted exomes and bigWig for whole genomes. An example of CNV data converted to bedGraph and visualized in IGV is given in this discussion.
  • Alternatively, researchers versed in R may choose to visualize subsets of data using RStudio.

Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.

  • In the denoised copy ratio segment plot, individual targets still display as points on the plot. Different copy ratio segments are indicated by alternating blue and orange color groups. The denoised median is drawn in thick black.
  • In the allele fraction plot, the boxes surrounding the alternate allelic fractions do NOT indicate standard deviation nor standard error, which biomedical researchers may be more familiar with. Rather, the allelic fraction data is given in credible intervals. The allelic copy ratio plot shows the 10th, 50th and 90th percentiles. These should be interpreted with care as explained in section 8.1. Individual allele fraction data display as faint data points, also in orange and blue.

8A. Initial segmentation before MCMC smoothing gives 638 segments.

8B. Default smoothing gives 398 segments.

8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the --number-of-smoothing-iterations-per-fit parameter.

Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.

8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.

8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.

Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?

For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.

INFO  MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...

The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.

In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.

Is it possible a copy ratio of one is not diploid but represents some other ploidy?

For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?

☞ 8.1 Some considerations in interpreting allelic copy ratios

For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.

The matched-control case presents the following considerations.

  • If a matched control contains any region with copy number amplification, the skewed allele fractions still allow correct interpretation of the original heterozygosity.
  • However, if a matched control contains deleted regions or regions with copy-neutral loss of heterozygosity or a long stretch of homozygosity, e.g. as occurs in uniparental disomy, then these regions would go dark so to speak in that they become apparently homozygous and so ModelSegments drops them from consideration.
  • From population sequencing projects, we know the expected heterozygosity of normal germline samples averages around one in a thousand. However, the GATK4 CNV workflow does not account for any heterozygosity expectations. An example of such an analysis that utilizes SNP array data is HAPSEG. It is available on GenePattern.
  • If a matched normal contains tumor contamination, this should still allow for the normal to serve as a control. The expectation is that somatic mutations coinciding with common germline SNP sites will be rare and ModelSegments (i) only counts the dominant alt allele at multiallelic sites and (ii) recognizes and handles outliers. To estimate tumor in normal (TiN) contamination, see the Broad CGA group's deTiN.

Here are some considerations for detecting loss of heterozygosity regions.

  • In the matched-control case, if the case sample is pure, i.e. not contaminated with the control sample, then we see loss of heterozygosity (LOH) segments near alternate-allele fractions of zero and one.
  • If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.

    8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.

  • In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.

    8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.

    Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.

Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.

☞ 8.2 Some results of fine-tuning smoothing parameters

This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration [12].

--number-of-changepoints-penalty-factor 1.0 \
--kernel-variance-allele-fraction 0.025 \
--kernel-variance-copy-ratio 0.0 \
--kernel-scaling-allele-fraction 1.0 \
--smoothing-credible-interval-threshold-allele-fraction 2.0 \
--smoothing-credible-interval-threshold-copy-ratio 2.0 \

The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.

8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.


8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing --kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.


8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.


8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.


back to top


[9] The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the AC allele count and AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.

  • For targeted exomes, it may be convenient to subset these to the preprocessed intervals, e.g. with SelectVariants for use with CollectAllelicCounts. This is not necessary, however, as ModelSegments drops sites outside the target regions from its analysis in the joint-analysis approach.
  • For whole genomes, depending on the desired resolution of the analysis, consider subsetting the gnomAD sites to those commonly variant, e.g. above an allele frequency threshold. Note that SelectVariants, as of this writing, can filter on AF allele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
  • For more resolution, consider adding sample-specific germline variant biallelic SNPs-only sites to the intervals. Section 8.1 shows allelic segmentation results for such an analysis.

[10] The MAPQ20 threshold of CollectAllelicCounts is lower than that used by CollectFragmentCounts, which uses MAPQ30.

[11] In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).

[12] These advanced smoothing recommendations are from one of the workflow developers--@slee.

  • For smoother results, first increase --number-of-changepoints-penalty-factor from its default of 1.0.
  • If the above does not suffice, then consider changing the kernel-variance parameters --kernel-variance-copy-ratio (default 0.0) and --kernel-variance-allele-fraction (default 0.025), or change the weighting of the allele-fraction data by changing --kernel-scaling-allele-fraction (default 1.0).
  • If such changes are still insufficient, then consider adjusting the smoothing-credible-interval-threshold parameters --smoothing-credible-interval-threshold-copy-ratio (default 2.0) and --smoothing-credible-interval-threshold-allele-fraction (default 2.0). Increasing these will more aggressively merge adjacent segments.

[13] In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.

  1. Perform MCMC (Gibbs) to fit the copy-ratio model posteriors.
  2. Use optimization (of the log likelihood) to initialize the Markov Chain for the allele-fraction model.
  3. Perform MCMC (Gibbs and slice) to fit the allele-fraction model posteriors.
  4. The initial model is now fit. We write the corresponding modelBegin files, including those for global parameters.
  5. Iteratively perform segment smoothing.
  6. Perform steps 1-4 again, this time to generate the final model fit and modelFinal files.

[14] @slee shares the tool initializes the MCMC by starting off at the maximum a posteriori (MAP) point in parameter space.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

back to top

Post edited by shlee on


  • FPBarthelFPBarthel HoustonMember ✭✭

    Can we use the same germline VCF file for CollectAllelicCounts as used for GetPileupSummaries, eg. small_exac_common_3_b37.vcf.gz?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @FPBarthel,

    I have gone down this path and was told the small_exac_common_3_b37.vcf.gz resource is too sparse for the purposes of CollectAllelicCounts. It may be sufficient for preliminary analyses but we hope you will consider using more sites towards modeling segments.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks @shlee ! Any specific recommendations here? Is af-only-gnomad.raw.sites.b37.vcf.gz an option, or likely too large? Any recommendations for using SelectVariants to filter it down smaller? I checked the b37 CNV WDL file on github and found that there are references to a file gs://gatk-test-data/cnv/somatic/common_snps.interval_list, is this file available anywhere?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    We don't have any specific recommendations @FPBarthel. The workflow is still in BETA as you know. Please refer to SelectVariants documentation and this WDL script for example commands. You should be able to access the gatk-test-data Google Storage bucket as it is public. You can either use gsutil (a tool available in the Google Cloud SDK software program) or https://console.cloud.google.com/storage/browser/gatk-test-data/cnv/somatic. Notice the structure of the URL address and how it corresponds to the gs:// address.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited June 2018

    Thank you @shlee very much for sharing that link, that is very useful. Meanwhile I construct a working file of my own using the following commands and the af-only-gnomad.raw.sites.b37.norm.vcf.gz and human_g1k_v37_decoy.fasta input files from the bundle, if this is of interest to anyone:

    ## Need to split multi-allelic sites across multiple lines, see
    ## https://gatkforums.broadinstitute.org/gatk/discussion/10975/use-select-variants-on-a-gnomad-vcf-for-mutect2-contamination-filtering
    bcftools norm \
        -f "human_g1k_v37_decoy.fasta" \
        -m- \
        -o "af-only-gnomad.raw.sites.b37.norm.vcf.gz" \
        -O z \
        --threads 6 \
    ## Had to remove contigs from VCF file because bcftools does not add length attribute to contig
    ## and SelectVariants does not proceed if lenght is missing (but works fine if contigs are 
    ## omitted entirely, in which case they are reconstructed from fasta)
    bcftools view \
        -h "af-only-gnomad.raw.sites.b37.norm.vcf.gz" | \
        sed '/^##contig/d' \
        > "newheader.txt"
    bcftools reheader \
        -h "newheader.txt" \
        -o "af-only-gnomad.raw.sites.b37.norm.reheader.vcf.gz" \
    ## Need to have an index because SelectVariants requires it
    gatk IndexFeatureFile \
        -F "af-only-gnomad.raw.sites.b37.norm.reheader.vcf.gz"
    gatk SelectVariants \
        -V  "af-only-gnomad.raw.sites.b37.norm.reheader.vcf.gz" \
        -O "af-only-gnomad.raw.sites.b37.selected.vcf.gz" \
        -R "human_g1k_v37_decoy.fasta" \
        --select "AF>0.05"
  • wisekhwisekh Member
    edited August 2018

    Please, discard my question. I found a type in my code.

  • wisekhwisekh Member

    Hi @shlee,

    Which files have allelic segments? I am looking at the output files from CallCopyRatioSegments and ModelSegments step, but I don't see any file having allelic segments. I only see *allelic ratio segments** and copy ratio segments.

    Thank you in advance,

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @wisekh,

    If your analysis called both copy ratio alterations and allelic segments at the same time, then the allelic segments should be the same as the copy ratio segments.

  • FPBarthelFPBarthel HoustonMember ✭✭

    I was just going over the most recent somatic CNV WDL (updated 11 days ago) and it looks like there has been quite a large change compared to what is described in this tutorial. It seems that ModelSegments and CallCopyRatioSegments should now be ran separately for tumor and normal in contrary to to this tutorial, which (I believe?) suggests just running it for tumor, with the normal supplied using --normal-allelic-counts to ModelSegments? Am I understanding this correctly?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @FPBarthel,

    Thanks for updating the community about the recent changes to this workflow's WDL pipeline. I'm not familiar with the changes and this is why the tutorials are pinned to a specific version of GATK and we provide matched WDL workflows in the gatk repo for tools undergoing active development.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks for clarifying! A question that occurred to me: is it possible to estimate tumor purity (and ploidy) from the naive GATK4 model segments output?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @FPBarthel,

    is it possible to estimate tumor purity (and ploidy) from the naive GATK4 model segments output?

    No, this is not possible. For this, you could run the SEG files through external tools such as ABSOLUTE.

  • Dear @shlee ,

    I have a short question on a content of public folder https://console.cloud.google.com/storage/browser/gatk-test-data/cnv/somatic you have mentioned some comments above.

    Could you please point us to the tutorials or projects that use that files?

    in particular: PoN_4.0_WGS_for_public.25k_bp.pon.hdf5 looks like WGS PoN with 25kb windiw size. Why 25kb is used instead of 1kb (GATKwr23-S2-Somatic_CNVs.pdf presentation)? Actually, I was more surprized by 1000bp window size in presentation - as most of CNV tools use 10-50kb window. Just wonder if I am doing it right.

    second: could we use such files from other projects as CNV PoN in absense of normals in our cohort? empirically - It helps a lot to denoise our signal - I tried. but is it fair?

    Thank You!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @sergey_ko13,

    The tutorials above offer example data through the FTP site and a GoogleDrive folder linked to in the Download example data section of the first part, in Tutorial#11682. I'm not familiar with the data in the gcloud storage bucket: https://console.cloud.google.com/storage/browser/gatk-test-data/cnv/somatic. It looks like the pipelines offered in https://github.com/gatk-workflows utilize this data, specifically:

    However, I do not see any utilization of PoN_4.0_WGS_for_public.25k_bp.pon.hdf5. It may be a data file that results from one of the WDL pipelines. I think @bshifaw can clarify.

    In general bin size should be tuned to sequencing depth. For typical 30x WGS, our workflows recommend 1000bp bin size as a good place to start. If you have less depth, then increase the size of the bin. If you have more depth, you could decrease the size of the bin. Yes, the GATK CNV workflows are improvements over past tools and are meant to provide additional resolution.

    second: could we use such files from other projects as CNV PoN in absense of normals in our cohort? empirically - It helps a lot to denoise our signal - I tried. but is it fair?

    The workflow is meant to allow for denoising of unmatched samples via the panel of normals. So yes, you should be able to use normal samples from other projects towards creating a PoN. The key thing to remember is that matching for batch effects is critical and you should probably match for ethnicity composition.

  • Great! Thank You!

  • liummliumm chinaMember

    Hi, I run GATK4 to call somatic CNV, command as below: gatk ModelSegments --denoised-copy-ratios B1701.markDup.sorted.recal.bam.denoisedCR.tsv --output cnv_B1701 --output-prefix B1701.ModelSegments

    in denoisedCR.tsv, there's this line as below:
    8 128748585 128753459 1.105242

    why is this not called as CNV, cause actually it's should be a positive call.
    can you tell me if there's some options to set to get this region called as cnv

  • lzhan140lzhan140 Member

    Hi, I'm running CollectAllelicCounts for my whole genome sequencing samples, but it is very slow. The starting speed is about 6,700,000 Loci/Minute but decreased to 267,000 after 20 hours and it was still on chr2 after that long time. How should I speed up or change to get better performance on this step?

    I'm running GATK, 4 cores, 36g RAM. I also used your -L wgs_calling_regions.hg38.interval_list. My java options are -Xms30g -Xmx34g.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited November 2018

    Hi @liumm,

    The data you attached shows zero values for every interval including for

    8 128748585 128753459 1.105242

    Just letting you know in case you accidentally swapped your files and that is why you are not getting CNV calls during segmentation.

    The denoisedCR.tsv gives the LOG2_COPY_RATIO for each interval and is a step that comes after standardization but before segmentation. Currently, the workflow requires performing segmentation to actually call CNV events. There is also the tool, CallCopyRatioSegments, which was still under some development last I checked. It's my understanding the decision to call a copy number event is based not solely on a given locus but the decision incorporates information from bins/intervals surrounding the locus. This smoothing allows for regions that by some artifact present as aberrant but you want to ignore. If you think such a site is not an artifact but in fact a CNV event, then it is possible to fine-tune ModelSegments parameters to lessen smoothing (increase sensitivity) such that the event is called. I go over a few ModelSegments parameters to adjust smoothing in Section 8.2 of Tutorial#11683. I suggest starting with these and referencing the tool documentation for details on what the parameters actually do and for other adjustable parameters. Please let us know how it goes and if you have more technical questions for our developers.

    I just want to reiterate that the ModelSegments CNV workflow is meant to be fine-tuned on a case-by-case basis.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @lzhan140,

    For CollectAllelicCounts, you should provide the tool SNPs-only sites. The tutorial presents this resource as -L chr17_theta_snps.interval_list. The wgs_calling_regions.hg38.interval_list file you are using looks familiar--does it contain large intervals? It is important to limit collecting allelic counts to common or sample-matched SNPs-only sites. You can provide a separate SNPs file in addition to your calling regions. One trick I have found (after writing this tutorial) that speeds up the performance is to provide the SNP sites in a sites-only VCF instead of an interval_list. I have prepared a use-at-your-own-risk liftedover resource for GRCh38 (based on GRCh37 gnomAD cleaned up by our Mutect2 developer for Mutect2 use) and you can find this in the GATK Resource Bundle, on the FTP site, under the folder beta/CNV. I made two different versions that differ by the inclusion of the AF population allele frequency, in case researchers desire to filter further on AF. We recommend you subset the records to those with your coverage collection intervals. I hope this helps.

  • DS89DS89 UKMember

    I am trying to run CollectAllelicCounts on a 200GB tumor BAM file from WGS. The program ran for 300 mins and exited "java.lang.OutOfMemoryError: Java heap space". The memory I have provided is Xmx57000m. I am using the dbSNP as common sites "-L" here. This same program runs fine for a low coverage WES.

    Could you help figure out why CollectAllelicCounts program is using a lot of memory even before it starts processing each interval?


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited November 2018

    Hi @DS89,

    How much RAM does your system have? The Xmx57000m, or 57GiB limit you set applies only if your system has more than this amount of memory. For a linux system, you can check with free -m:

    The dbSNP resource you are using, did you subset down to SNPs-only sites? That is, remove indel and mixed-type sites? You might consider preprocessing the resource you use to collect allelic counts data.

  • lzhan140lzhan140 Member

    Hi @shlee,

    Yes, you are right. I changed to a SNP site interval file instead of the large genomic interval file. It was done in 10 mins with 5m SNP sites. Thanks a lot.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Great to hear @lzhan140. Thank you for letting us know the solution worked.

  • DavidNixDavidNix Member

    Hello Folks,

    I build up a dockerized snakemake workflow following your two excellent tutorials: https://github.com/HuntsmanCancerInstitute/Workflows/tree/master/Hg38RunnerWorkflows/CNV

    This generates several bigwig formatted graph files for AF and CR for both tumor and normal, as well as a summary xls spreadsheet.

    One key issue I see is the presence of many copy ratio calls that are also present in the matched normal. I'm using a set of 25 normals for the denoising step but these don't include the matched normal. Thus I suspect it is causing the false positives. I've built in flags for this situation into the spreadsheet output. But ideally, your tools could be adapted to incorporate the matched normal during the calling step and remove these.

    Thanks for the excellent tool kit.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @DavidNix,

    Thank you for sharing your dockerized snakemake pipeline. I have heard of snakemake--how do you like it? We also provide a pipelining solution with a language called WDL that is run by software called Cromwell. You can read more about these at https://software.broadinstitute.org/wdl/documentation/. As our workflows mature, each version of the toolkit provides up-to-date WDL scripts for the pipelines, including the CNV workflow. You can find these at https://github.com/broadinstitute/gatk/tree/master/scripts.

    Our developers are aware that the ModelSegments CNV workflow may output rare germline CNVs that are unaccounted for by the PoN samples.

    One solution GATK provides is the Germline CNV (gCNV) workflow. You can run your matched normal through the gCNV workflow to call both rare and common germline CNVs. I think you could then use these calls to selectively filter germline events from your ModelSegments CNV results.

    Soo Hee

  • DavidNixDavidNix Member

    Thanks Soo Hee for the pointer for WDL. We've been using snakemake for years, before these other systems were available, it is excellent for what it does. I wouldn't say these normal calls are rare, I see about 2-5 in each somatic exome set we run. They also are likely false positives due to artifacts in the particular sample set not present in the PoN.

    -cheers, David

  • lishiyonglishiyong Member

    I am use ModelSegments with paired WGS samples with gatk- The shell is: "~/bin/gatk- --java-options "-Xmx4g" ModelSegments --denoised-copy-ratios 18060701BP.10K.denoisedCR.tsv --allelic-counts 18060701T.allelicCounts.tsv --normal-allelic-counts 18060701W.allelicCounts.tsv --output ./ModelSegments --output-prefix 18060701T"

    The error information:
    Using GATK jar ~/bin/gatk-
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar ~/bin/gatk- ModelSegments --denoised-copy-ratios 18060701T.10K.denoisedCR.tsv --allelic-counts 18060701T.allelicCounts.tsv --normal-allelic-counts 18060701W.allelicCounts.tsv --output ./ --output-prefix 18060701T
    11:44:01.217 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:~/bin/gatk-!/com/intel/gkl/native/libgkl_compression.so
    11:44:03.099 INFO ModelSegments - ------------------------------------------------------------
    11:44:03.100 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.1.0.0
    11:44:03.100 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
    11:44:03.100 INFO ModelSegments - Executing as [email protected] on Linux v4.14.0-1.el7.elrepo.x86_64 amd64
    11:44:03.100 INFO ModelSegments - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_25-b17
    11:44:03.101 INFO ModelSegments - Start Date/Time: March 7, 2019 11:44:01 AM CST
    11:44:03.101 INFO ModelSegments - ------------------------------------------------------------
    11:44:03.101 INFO ModelSegments - ------------------------------------------------------------
    11:44:03.102 INFO ModelSegments - HTSJDK Version: 2.18.2
    11:44:03.102 INFO ModelSegments - Picard Version: 2.18.25
    11:44:03.102 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    11:44:03.102 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    11:44:03.102 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    11:44:03.103 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    11:44:03.103 INFO ModelSegments - Deflater: IntelDeflater
    11:44:03.103 INFO ModelSegments - Inflater: IntelInflater
    11:44:03.103 INFO ModelSegments - GCS max retries/reopens: 20
    11:44:03.103 INFO ModelSegments - Requester pays: disabled
    11:44:03.103 INFO ModelSegments - Initializing engine
    11:44:03.103 INFO ModelSegments - Done initializing engine
    11:44:03.107 INFO ModelSegments - Reading file (18060701T.10K.denoisedCR.tsv)...
    11:44:03.706 INFO ModelSegments - Reading file (18060701T.allelicCounts.tsv)...
    11:44:14.885 INFO ModelSegments - Reading file (18060701W.allelicCounts.tsv)...
    11:44:28.600 INFO ModelSegments - Shutting down engine
    [March 7, 2019 11:44:28 AM CST] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 0.46 minutes.
    java.lang.IllegalArgumentException: Metadata do not match for input case-sample files.
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
    at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.getValidatedMetadata(ModelSegments.java:609)
    at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:483)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @lishiyong,

    For the error message:

    Metadata do not match for input case-sample files.

    It appears there is a mismatch between your denoised copy ratios and allelic counts data. Can you double-check that the data are matched, e.g. use the exact same intervals?

  • lishiyonglishiyong Member

    Hi @shlee
    I use the same intervals.
    Please see the whole pipeline:
    ~/bin/gatk- CollectReadCounts -I 18060701T.sorted.dup.bqsr.bam -L hg19_1M.preprocessed.autosome.interval_list --interval-merging-rule OVERLAPPING_ONLY -O 18060701T.counts.hdf5

    ~/bin/gatk- --java-options "-Xmx12g" DenoiseReadCounts -I 18060701T.counts.hdf5 --count-panel-of-normals cnvPON.10K.hdf5 --standardized-copy-ratios 18060701T.10K.standardizedCR.tsv --denoised-copy-ratios 18060701T.10K.denoisedCR.tsv

    ~/bin/gatk- PlotDenoisedCopyRatios --standardized-copy-ratios 18060701T.10K.standardizedCR.tsv --denoised-copy-ratios 18060701T.10K.denoisedCR.tsv --sequence-dictionary /thinker/storage/world/gene/database/hg19/hg19/hg19.dict --minimum-contig-length 50000000 --output CopyRatio/10K/ --output-prefix 18060701T.10K

    ~/bin/gatk- SelectVariants -R /thinker/storage/world/gene/database/hg19/hg19/hg19.fa -V af-only-gnomad.b37.chr.vcf -L hg19_1M.preprocessed.autosome.interval_list --select-type-to-include SNP -O af-only-gnomad.select.b37.chr.vcf

    ~/bin/gatk- --java-options "-Xmx12g" CollectAllelicCounts -L af-only-gnomad.select.b37.chr.vcf -I 18060701T.sorted.dup.bqsr.bam -R /thinker/storage/world/gene/database/hg19/hg19/hg19.fa -O 18060701T.allelicCounts.tsv

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @lishiyong,

    Thanks for your pipeline. I see that you have a step to collect allelic counts for your tumor. Can you confirm you use the same intervals when collecting allelic counts for your normal, 18060701W.allelicCounts.tsv?

    Did you know GATK offers pipelined solutions for researchers? For the ModelSegments workflow, you can find these at https://github.com/gatk-workflows/gatk4-somatic-cnvs. Development versions are at https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl.

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited March 2019

    Hi @lishiyong,

    That error message suggests that either 1) the sample name, and/or 2) the sequence dictionary does not match in the header of your 18060701T.10K.denoisedCR.tsv and 18060701T.allelicCounts.tsv files.
    Can you confirm?

    If the sequence dictionary does not match, I'm not sure where this inconsistency was introduced in your toolchain, but you will probably want to check that your reference, original/preprocessed interval lists, and common-sites VCF all have matching dictionaries.


  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @DavidNix,

    Apologies, just now seeing your comment regarding the presence of germline events from the matched normal in the tumor segments. We are indeed aware of this issue and plan to address it with downstream tools that either implement a filtering step or call germline/somatic CNVs simultaneously.

    For now, users may want to implement their own filtering step using the ModelSegments output from both the matched normal and tumor, or even the gCNV calls for the matched normal (it sounds like you may have made some progress in this direction). Alternatively, one could adjust the segmentation parameters for ModelSegments to be less sensitive to smaller germline CNVs.


  • CritEinCritEin ChinaMember

    Hi @shlee,
    I use GATK4 to call somatic CNV ,the case-only without a matched-control, so I don't do the section 5 and omit allelic-counts parameters from the example commands in sections 6 and 8 as the tutorials say.

    When I have done the section 7 , I get the resulting called.seg data.These are copy ratios, not copy numbers, so with the copy ratio and allele fraction estimates we can use other tools to predict absolute copy number.

    The question is that how can I get the absolute copy numbers or which tools can ues the copy ratio to predict the absolute copy numbers?


Sign In or Register to comment.