Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

After gCNV calling considerations

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

The Germline CNV (gCNV) workflow tutorial is available at https://gatkforums.broadinstitute.org/gatk/discussion/11684. The tutorial lays out the steps of the workflow to obtain per-sample VCF calls. This post discusses some considerations after gCNV calling.


Note the workflow came out of beta and is in production status with the v4.1.0.0 release. Towards this, GATK developers focused on solving the hard problems they excel at for sensitive CNV detection. For example, the ability to call the copy number states of multi-allelic CNVs, e.g. for the amylase locus, alongside rare variants is an algorithmic advancement.

Here are five directions to take with gCNV calls that we illustrate with Tutorial#11684's data.

  1. Visualize gCNV results in IGV and compare to truth set calls
  2. Towards ascertaining the quality of gCNV calls, use Jupyter Notebooks
  3. Consider ModelSegments CNV for rare gCNV detection
  4. Go back and call short variants for gCNV event regions
  5. Explore functional implications with GTEx

1. Visualize gCNV results in IGV and compare to truth set calls

The 1000 Genomes Project Phase 3 SV callset was a landmark effort, and you can read about it in the article titled An integrated map of structural variation in 2,504 human genomes (2015; doi.org/10.1038/nature15394). The tutorial's 24 samples are a part of the SV cohort and so it is possible to use calls subset from the corresponding GRCh38 callset as truth for the tutorial results. Some other SV resources are:

  • A resource that @cwhelan shares is NCBI's dbvar (database of variation) at https://www.ncbi.nlm.nih.gov/dbvar/. It is a database of human genomic structural variation. Data are available for human assembly versions GRCh37 and GRCh38.

  • As of 2019/3/14, gnomAD provides population SV calls, which include CNVs. Data is on the GRCh37/hg19 human reference assembly. The biorxiv paper is here. It is possible to browse population SV calls in the gnomAD browser. Toggle the button to display SV data instead of short variant data.


Towards concordance measurements, the first step is to perform some spot-checking with visualization. The gCNV workflow generates VCF callsets. Although IGV (Integrative Genomics Viewer) visualizes VCF calls just fine, it is useful to heatmap color copy number states using SEG format data (.seg or .seg.gz). For large data, e.g. WGS genotyped intervals, for fast access, the indexed binary BigWig format is best. Here we visualize results from the gCNV tutorial against a truth set.


The twelve regions show regions of truth set events larger than 1Kbp. Tracks are as follows.

  1. NA19017 BAM coverage histogram
  2. 1000 Genomes Project integrated truth set CN calls for NA19017, which happen to only contain deletion calls
  3. Case-mode NA19017 segments using default parameters and small regions
  4. Cohort-mode NA19017 segments using default parameters and small regions
  5. Cohort-mode NA19017 segments using sensitive parameters (Tutorial#11684 section 4.1) and small regions
  6. RefSeq Genes
  7. Regions where all 24 samples in the cohort were diploid from a different analysis. Breaks indicate regions where at least one sample gives a copy number event.

Here, SEG track deletions range from light blue to blue (CN1-CN0) and amplifications from salmon to red (CN3-CN5). Normal CN2 ploidy is in white. The case and cohort mode results are highly concordant. Against the truth set, by eye, it appears they are missing three of the calls in the regions. Turning the sensitivity knob up gives better concordance with truth set, as gCNV then additionally calls two of the three missing calls. Both the default and sensitive parameter runs pickup additional events absent from the truth set.


If we focus on the first region, we see the default and sensitive parameter runs give differently sized events such that the functional impact on SIRPB1 expression is likely different. The 1000 Genomes Project truth set is an integrated callset. The truth set also gives two differently sized deletion events for the locus. Each of these calls originates from a different caller; the larger deletion is from CNVnator and the smaller deletion is from GenomeStrip.

Note, the tutorial's default parameter run gives male XY samples CN1 for the majority of chrX, including for PAR regions, where coverage is actually on par with the CN2 of female XX samples. The sensitive parameter run of Tutorial#11684 section 4.1 in turn correctly gives male XY samples CN2 for chrX PAR regions and CN1 for the majority of the remaining regions. The figure below illustrates with HG00096 (male) and NA19017 (female) samples.


Here are some example commands to convert gCNV VCF results to SEG format.

[i] VariantsToTable to subset and columnize annotations

gatk VariantsToTable \
-V genotyped-segments-case-twelve-vs-cohort23.vcf.gz \
-O genotyped-segments-case-twelve-vs-cohort23.table.txt

[ii] Unix shell commands to convert to SEG format data

This involves adding a first column with the sample name.

sampleName=$(gzcat genotyped-segments-case-twelve-vs-cohort23.vcf.gz | grep -v '##' | head -n1 | cut -f10)
awk -v sampleName=$sampleName 'BEGIN {FS=OFS="\t"} {print sampleName, $0}' genotyped-segments-case-twelve-vs-cohort23.table.txt > ${i}.seg; head ${i}.seg

Combine [i] and [ii] into a for-loop to process multiple gCNV VCFs into SEGs

for i in `ls genotyped-segments*.vcf.gz`; 
do sampleName=$(gzcat $i | grep -v '##' | head -n1 | cut -f10); 
gatk4100 VariantsToTable -V ${i} -F CHROM -F POS -F END -GF NP -GF CN -verbosity ERROR -O ${i}.table.txt; 
awk -v sampleName=$sampleName 'BEGIN {FS=OFS="\t"} {print sampleName, $0}' ${i}.table.txt > ${i}.seg; head ${i}.seg; 

2. Towards ascertaining the quality of gCNV calls, use Jupyter Notebooks

Towards profiling the quality of gCNV calls, for example, to be able to filter low-quality false positives, consider interacting with data via Python Jupyter Notebooks. To illustrate elements of gCNV results and to showcase the utility of Notebooks, the gCNV tutorial comes with two companion Notebook reports.

  • Notebook#11685 shows an approach to measuring concordance of a sample's gCNV calls to 1000 Genomes Project truth set calls.
  • Notebook#11686 examines gCNV callset annotations using larger data, namely chr20 gCNV results from the tutorial's 24-sample cohort.

We know from short variant discovery there is much more to quality calls than just calling with HaplotypeCaller. In short variant calling, we are stringent in preprocessing, e.g. by removing duplicate inserts from consideration, empower detection by analyzing a cohort, filter out unlikely calls based on numerous annotations and refine calls using pedigree and population allele frequencies.

Wrangling structural variants, which include CNVs, requires carving new paths in how we think about variation. For example, how do we go about similar quality control for variants that have imprecise starts and ends, add a dimension of copy number states and are based solely on a raw data feature that is inherently noisy and highly susceptible to batch effects? Coverage fluctuations blur the line between what is signal and what is noise.

Speaking of batch effects, RNA-Seq amplifies such effects. In differential expression analyses, researchers go so far as to attempt to correct for batch effects at the risk of losing true biological signal. For example, ComBat is a correction algorithm that considers covariates and provides parametric (assumes a prior probability distribution) and non-parametric (no prior assumptions) corrections for expression data. If we think about it, GermlineCNVCaller's model fitting similarly performs correction for DNA-Seq data, also by considering covariates and learned bias factors. To be clear, GATK gCNV is not suitable for expression data, as it makes integer ploidy assumptions and also assumes the majority of coverage across the genome is normal, i.e. diploid for mammals.

3. Consider ModelSegments CNV for rare gCNV detection

If research aims focus on rare gCNV events only, e.g. non-multiallelic CNV singletons for a sample compared to a cohort, it is worth looking into the GATK4 ModelSegments CNV workflow, which is sensitive to fractional changes and runs amazingly quickly. The ModelSegments CNV workflow is designed for somatic CNA detection and thus operates with different assumptions than the gCNV workflow. Below is one possible outcome from running ModelSegments CNV on Tutorial#11684's sample NA19017 against a PoN made from the remaining 23 samples.


4. Go back and call short variants for gCNV event regions

image First, my thanks to @bshifaw for graceously agreeing to prepare a short variant callset for Tutorial#11684's 24-sample cohort (using the workflow here).

If we examine regions the short variant discovery workflow filters alongside Tutorial#11684's gCNV events, we see an interesting correlation. For example, in the locus in the screenshot, it appears short variants (top track) coinciding with the copy number event (tracks 3-6) are all filtered. What could explain this coincidence?

Blatting the region of the CNV event shows many other loci with high sequence similarity and lower-case reference bases. This region's sequence is repeated in the genome. Repeated sequence regions yield low mapping qualities (MAPQ), which the gCNV workflow accounts for in two ways. First, coverage collection employs the MappingQualityReadFilter, which excludes from counts reads with MAPQ less than ten. Second, the gCNV workflow can incorporate read mappability data towards filtering intervals (Tutorial#11684 section 2).

Many other copy number event regions do not show such swaths of coincident filtered short variants. For these other regions, it may be desirable to call short variants with a corrected copy number, e.g. using HaplotypeCaller's -ploidy argument and the -L intervals argument. The GATK short variant discovery workflow is amenable to calling on different copy number regions as well as on mixed copy number regions. It is possible to plan ahead for minimal HaplotypeCaller GCVF mode runs as alluded to in this discussion.

5. Explore functional implications with GTEx


Towards dissecting functional implications, researchers may find consulting the Broad GTEx portal insightful. For example, check out the aforementioned SIRPB1 locus. The figure above shows tissue-specific exon expression for SIRPB1. The transcript isoforms are at the bottom. Gene directionality is right to left. Besides the dark coloring for the full-length transcript in whole blood tissue, it appears other tissues including brain express mostly the upstream exons.



SIRPB1 eQTL data corroborate such differential expression in the two organ types. The first figure above recapitulates the gCNV calls for the locus and expands the Gene track to show common isoforms. The second figure plots eQTLs across the gene, where:

  • In whole blood eQTLs present in the 3' two-thirds of the gene correlated with expression.
  • In brain tissues eQTLs present in the 5' third of the gene correlated with expression.

The gCNV workflow calls a copy number deletion for the SIRPB1 locus. The default parameters (first two SEG tracks) versus the sensitive parameters (third SEG track) give differently sized events. The gCNV tutorial cohort consists of 24 samples, which is roughly a quarter of the recommended minimum of 100 model samples. It may be that the sensitive WGS parameters help compensate for the lack of samples. It is most likely even with a hundred samples, workflow parameters need tuning for optimal results. This illustrates the importance of tuning parameters appropriately for data.

Scanning literature for SIRPB1 yields interesting biological correlations associated with the deletion, including implication in Alzheimer's disease (a; b; c; d).

Post edited by shlee on
Sign In or Register to comment.