Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(Notebook) Concordance of NA19017 chr20 gCNV calls

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited April 1 in Tutorials

This notebook shows how to perform the analysis that is alluded to in Tutorial#11684. Namely, the notebook performs a comparison of gCNV calls to 1000 Genomes Project truthset calls using the tutorial data. Tutorial#11684 illustrates with sample NA19017 as does this notebook. A straightforward exercise is to recapitulate the notebook analysis for results from using different parameters, for differently sized data or for another sample in the cohort.

The sections of Notebook#11685 are as follows.

  1. Visualize callset overlap with IGV

  2. Compare callset overlap with BedTools

  3. Explore the gCNV calls

Download an HTML version and the .ipynb format notebook here.


Install and import Python packages

# Prerequisites include Python 3, Jupyter, GATK4 and BedTools.
# Notebook results use v3.7.1, v4.4.0, v4.1.0.0 and v2.27.1, respectively, and assumes a Unix environment.
# Installation requires root access and presumes igv is setup as detailed in https://github.com/igvteam/igv-jupyter/issues/23
! pip install numpy pandas matplotlib seaborn igv
import igv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

1. Visualize callset overlap with IGV

This exercise uses the igv.js-jupyter extension, which is interactive.

1.1 Prepare truthset calls for comparison

This preparation omits conversion of truthset calls to SEG format.

# Navigate to directory with the tutorial data and list vcf.gz format files.
%cd /Users/shlee/Documents/gatk4/gatk4_gcnv/tutorial
! ls *.vcf.gz
# Subset 24-sample variant callset to samples of interest and the regions the gCNV analysis covers.
# The input VCF was previously subset to the 24 samples, chr20 and variant CNV events.
! gatk4100 SelectVariants \
-V 1kgpSV-v2-20130502_cohort24-chr20varCNVs.vcf.gz \
--exclude-non-variants true \
--remove-unused-alternates true \
--keep-original-ac true \
--keep-original-dp true \
--sample-name NA19017 \
-L chr20sub_rmvChr.bed \
-O 1kgpSV_NA19017_chr20sub.vcf.gz

1.2 Visualize truthset sample calls alongside gCNV sample calls

# Invoke the IGV browser and set IGV's reference and genomic location(s) of interest.
# The first view is space to accomodate track labels, which otherwise overlap features.

b = igv.Browser(
    {
        "genome": "hg38",
        "locus": [  "chr20:1-32000",
                    "chr20:1,560,577-1,625,582",
                    "chr20:43,641,139-43,647,469",
                    "chr20:44,662,510-44,707,929",
                    "chr20:55,798,693-55,875,381",
                    "chr20:63,092,730-63,094,955"
        ]
    }
)
b.show()

# Load data tracks
b.load_track(
    {
        "name": "1kgp",
        "url": "1kgpSV_NA19017_chr20sub.vcf.gz",
        "format": "vcf",
        "type": "variant"
    }
)
b.load_track(    
        {
        "name": "1kgp.seg",
        "displayMode": "EXPANDED",
        "url": "1kgpSV_NA19017_CN.seg",
        "format": "seg",
        "indexed": False
    }
)
b.load_track(
    {
        "name": "gCNV",
        "url": "genotyped-segments-cohort24-NA19017.vcf.gz",
        "format": "vcf",
        "type": "variant"
    }
)    
b.load_track(    
        {
        "name": "gCNV.seg",
        "displayMode": "EXPANDED",
        "url": "NA19017_chr20var.seg",
        "format": "seg",
        "indexed": False
    }   
)
'OK'
b.get_svg()
'OK'
b.display_svg()


2. Compare callset overlap with BedTools

2.1 Prepare gCNV calls for comparison

Tutorial#11684 illustrates conversion of the gCNV workflow VCF results to SEG format. Here we subset to chr20 data where CN is non-diploid and also convert to BED format.

# Navigate to directory with the tutorial data and list SEG format files.
%cd /Users/shlee/Documents/gatk4/gatk4_gcnv/tutorial
! ls *.seg
# Subset to chr20 data where CN does not equal 2
! cat genotyped-segments-cohort24-NA19017.vcf.gz.seg | awk '$2 == "CHROM" || $2 =="chr20"' | awk '$6 != 2' > NA19017_chr20var.seg
! head -n 3 NA19017_chr20var.seg
! tail -n 3 NA19017_chr20var.seg
! wc -l NA19017_chr20var.seg
NA19017 CHROM   POS END NA19017.NP  NA19017.CN
NA19017 chr20   133001  135000  2   1
NA19017 chr20   1606001 1609000 3   1
NA19017 chr20   64291001    64293000    2   1
NA19017 chr20   64295001    64296000    1   0
NA19017 chr20   64326001    64334000    8   3
      52 NA19017_chr20var.seg
# Convert SEG format to BED format data
! cut -f2-4 NA19017_chr20var.seg > case.intervals
! awk '{OFS="\t"} NR>1 {print $1, $2-1, $3}' case.intervals > case.bed
! head -n 3 case.bed
! tail -n 3 case.bed
! wc -l case.bed
chr20   133000  135000
chr20   1606000 1609000
chr20   1914000 1915000
chr20   64291000    64293000
chr20   64295000    64296000
chr20   64326000    64334000
      51 case.bed

2.2 Prepare truthset calls for comparison

Use the single-sample VCF prepared in section 1.1.

# Examine the ALT alleles present
! gzcat 1kgpSV_NA19017_chr20sub.vcf.gz | grep -v '#' | cut -f5 | sort | uniq

# Examine the genotype designations
! gzcat 1kgpSV_NA19017_chr20sub.vcf.gz | grep -v '#' | cut -f10 | sort | uniq
<CN0>
0|1
1|0
1|1
# Convert VCF to table 
! gatk4100 VariantsToTable \
-V 1kgpSV_NA19017_chr20sub.vcf.gz \
-O 1kgpSV_NA19017_chr20sub.table \
-F CHROM -F POS -F END
# Convert table to BED format data 
! sed 's/^/chr/' 1kgpSV_NA19017_chr20sub.table | awk '{OFS="\t"} NR>1 {print $1, $2-1, $3}' > 1kgpSV_NA19017_chr20sub.bed
! head -n3 1kgpSV_NA19017_chr20sub.bed
! tail -n3 1kgpSV_NA19017_chr20sub.bed
! wc -l 1kgpSV_NA19017_chr20sub.bed
chr20   1408451 1410171
chr20   1580092 1613039
chr20   1609254 1613207
chr20   55859461    55865660
chr20   63093184    63094468
chr20   63093348    63094249
      21 1kgpSV_NA19017_chr20sub.bed

2.3 Use bedtools coverage to measure overlap between the two callsets

Here we examine overlap only and ignore whether copy-number states match.

  • The tutorial's smallest data, consisting of twelve regions, covers 1.4 Mb on chr20. Regions correspond to those in the truthset with events above 1000 bases.
  • The tutorial's chr20sub data covers a larger portion of chr20, around ~15 Mb.
# <https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html> gives interpretation guidelines
! bedtools coverage -a 1kgpSV_NA19017_chr20sub.bed -b case.bed > 1kgpSV_NA19017_chr20sub_vs_case.coverage
aTruthBCase = pd.read_csv('1kgpSV_NA19017_chr20sub_vs_case.coverage', sep='\t', header=None)
aTruthBCase.columns = ['chrom', 'start', 'end', 'nFeatures-B', 'bases-B', 'bases-A', 'fractionBases-B/A']
aTruthBCase

# Sort the table by 'bases-A':
aTruthBCase.sort_values('bases-A')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
chrom start end nFeatures-B bases-B bases-A fractionBases-B/A
8 chr20 23122679 23122981 0 0 302 0.000000
4 chr20 2379276 2379872 0 0 596 0.000000
5 chr20 3104161 3104765 0 0 604 0.000000
10 chr20 32739537 32740213 0 0 676 0.000000
9 chr20 24923059 24923778 0 0 719 0.000000
6 chr20 15730960 15731792 0 0 832 0.000000
20 chr20 63093348 63094249 0 0 901 0.000000
19 chr20 63093184 63094468 0 0 1284 0.000000
15 chr20 44667865 44669273 1 1000 1408 0.710227
0 chr20 1408451 1410171 0 0 1720 0.000000
17 chr20 52142653 52144379 1 1000 1726 0.579374
3 chr20 2014836 2016906 1 1000 2070 0.483092
16 chr20 44698668 44700883 3 2215 2215 1.000000
12 chr20 33698902 33701182 1 2000 2280 0.877193
7 chr20 21305449 21308134 1 2000 2685 0.744879
14 chr20 43643221 43645948 2 1779 2727 0.652365
13 chr20 34653921 34656669 1 2000 2748 0.727802
11 chr20 32937722 32941600 1 3000 3878 0.773595
2 chr20 1609254 1613207 0 0 3953 0.000000
18 chr20 55859461 55865660 2 2000 6199 0.322633
1 chr20 1580092 1613039 1 3000 32947 0.091055
# Perform the reciprocal comparison
! bedtools coverage -b 1kgpSV_NA19017_chr20sub.bed -a case.bed > 1kgpSV_NA19017_chr20sub_vs_case_reverse.coverage
bTruthACase = pd.read_csv('1kgpSV_NA19017_chr20sub_vs_case_reverse.coverage', sep='\t', header=None)
bTruthACase.columns = ['chrom', 'start', 'end', 'nFeatures-B', 'bases-B', 'bases-A', 'fractionBases-B/A']
bTruthACase
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
chrom start end nFeatures-B bases-B bases-A fractionBases-B/A
0 chr20 133000 135000 0 0 2000 0.000
1 chr20 1606000 1609000 1 3000 3000 1.000
2 chr20 1914000 1915000 0 0 1000 0.000
3 chr20 2015000 2016000 1 1000 1000 1.000
4 chr20 2824000 2825000 0 0 1000 0.000
5 chr20 11633000 11638000 0 0 5000 0.000
6 chr20 12802000 12807000 0 0 5000 0.000
7 chr20 18602000 18603000 0 0 1000 0.000
8 chr20 18951000 18954000 0 0 3000 0.000
9 chr20 21306000 21308000 1 2000 2000 1.000
10 chr20 21911000 21912000 0 0 1000 0.000
11 chr20 23426000 23428000 0 0 2000 0.000
12 chr20 23428000 23429000 0 0 1000 0.000
13 chr20 23430000 23431000 0 0 1000 0.000
14 chr20 24408000 24409000 0 0 1000 0.000
15 chr20 32723000 32724000 0 0 1000 0.000
16 chr20 32938000 32941000 1 3000 3000 1.000
17 chr20 33448000 33456000 0 0 8000 0.000
18 chr20 33699000 33701000 1 2000 2000 1.000
19 chr20 34230000 34231000 0 0 1000 0.000
20 chr20 34654000 34656000 1 2000 2000 1.000
21 chr20 42615000 42619000 0 0 4000 0.000
22 chr20 43643000 43644000 1 779 1000 0.779
23 chr20 43644000 43645000 1 1000 1000 1.000
24 chr20 44668000 44669000 1 1000 1000 1.000
25 chr20 44698000 44699000 1 332 1000 0.332
26 chr20 44699000 44700000 1 1000 1000 1.000
27 chr20 44700000 44701000 1 883 1000 0.883
28 chr20 47828000 47829000 0 0 1000 0.000
29 chr20 47829000 47830000 0 0 1000 0.000
30 chr20 47830000 47834000 0 0 4000 0.000
31 chr20 47834000 47835000 0 0 1000 0.000
32 chr20 47900000 47901000 0 0 1000 0.000
33 chr20 47901000 47902000 0 0 1000 0.000
34 chr20 48504000 48509000 0 0 5000 0.000
35 chr20 48511000 48516000 0 0 5000 0.000
36 chr20 52143000 52144000 1 1000 1000 1.000
37 chr20 55808000 55809000 0 0 1000 0.000
38 chr20 55863000 55864000 1 1000 1000 1.000
39 chr20 55864000 55865000 1 1000 1000 1.000
40 chr20 61943000 61944000 0 0 1000 0.000
41 chr20 61944000 61945000 0 0 1000 0.000
42 chr20 61946000 61947000 0 0 1000 0.000
43 chr20 63965000 63966000 0 0 1000 0.000
44 chr20 64094000 64095000 0 0 1000 0.000
45 chr20 64132000 64134000 0 0 2000 0.000
46 chr20 64174000 64175000 0 0 1000 0.000
47 chr20 64175000 64176000 0 0 1000 0.000
48 chr20 64291000 64293000 0 0 2000 0.000
49 chr20 64295000 64296000 0 0 1000 0.000
50 chr20 64326000 64334000 0 0 8000 0.000

The first aTruthBCase bedtools coverage result shows gCNV misses some truthset events. These tend to be smaller. The resolution of events gCNV will detect depends in part on coverage bin size. The tutorial data bin size is 1000 bases.

2.4 Create a pandas summary metrics table

The metrics are from the perspective of the truthset. Previously, we imported bedtools coverage results into a pandas dataframe called aTruthBCase.

# Create metrics to summarize the bedtools coverage results.
summaryMetrics = [aTruthBCase['bases-A'].sum(), aTruthBCase['bases-B'].sum()]
summaryMetricsDb = pd.DataFrame(summaryMetrics).transpose()
summaryMetricsDb.columns = ['total-A', 'total-B']
summaryMetricsDb['total-B/A'] = summaryMetricsDb['total-B']/summaryMetricsDb['total-A']
summaryMetricsDb['n-events-sum'] = aTruthBCase['nFeatures-B'].sum()
summaryMetricsDb['n-events-min'] = aTruthBCase['nFeatures-B'].min()
summaryMetricsDb['n-events-max'] = aTruthBCase['nFeatures-B'].max()
summaryMetricsDb['n-events-at-0'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 0].shape[0]
summaryMetricsDb['n-events-at-1'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 1].shape[0]
summaryMetricsDb['n-events-at-2'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 2].shape[0]
summaryMetricsDb['n-events-at-3'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 3].shape[0]

summaryMetricsDb['fraction-events-uncalled'] = summaryMetricsDb['n-events-at-0'] / len(aTruthBCase.index)
summaryMetricsDb['fraction-events-called'] = 1 - summaryMetricsDb['fraction-events-uncalled']

intermed1 = aTruthBCase.loc[aTruthBCase['nFeatures-B'] > 0, ['bases-B', 'bases-A']]
intermed2 = [intermed1['bases-A'].sum(), intermed1['bases-B'].sum()]
intermed3 = intermed2[1]/intermed2[0]
summaryMetricsDb['non-zero-events-B/A'] = intermed3

summaryMetricsDb['n-rows'] = len(aTruthBCase['nFeatures-B'] !=0)
summaryMetricsDb['n-min-50%-overlap'] = len(aTruthBCase.loc[aTruthBCase['fractionBases-B/A'] >= 0.5])
summaryMetricsDb['n-rows-non-0'] = summaryMetricsDb['n-rows'] - summaryMetricsDb['n-events-at-0']
summaryMetricsDb['fraction-50%-overlap-of-non-0'] = summaryMetricsDb['n-min-50%-overlap'] / summaryMetricsDb['n-rows-non-0']

summaryMetricsDb

Here, we arbitrarily define an overlap minimum of 50% of the truthset call as something of interest. The stringency for these types of concordance calculations can be much more refined, e.g. 50% reciprocal overlap.

2.5 Subset the events with 50% or more truthset overlap

concordant = aTruthBCase.loc[aTruthBCase['fractionBases-B/A'] > 0.5, ['chrom', 'start', 'end', 'bases-B', 'bases-A', 'fractionBases-B/A']]
concordant.sort_values('fractionBases-B/A')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
chrom start end bases-B bases-A fractionBases-B/A
17 chr20 52142653 52144379 1000 1726 0.579374
14 chr20 43643221 43645948 1779 2727 0.652365
15 chr20 44667865 44669273 1000 1408 0.710227
13 chr20 34653921 34656669 2000 2748 0.727802
7 chr20 21305449 21308134 2000 2685 0.744879
11 chr20 32937722 32941600 3000 3878 0.773595
12 chr20 33698902 33701182 2000 2280 0.877193
16 chr20 44698668 44700883 2215 2215 1.000000
# Convert concordant regions to list of intervals
concordant.to_csv('concordant.bed', columns=["chrom", "start", "end"], sep='\t', index=False, header=False)
! head concordant.bed
# Subset gCNV calls to those concordant with the truthset
! gatk4100 SelectVariants \
-V genotyped-segments-cohort24-NA19017.vcf.gz \
-L concordant.bed \
-O segments-NA19017-concordant.vcf.gz
# Check the number of records before and after subsetting
! gzcat genotyped-segments-cohort24-NA19017.vcf.gz | grep -v '#' | wc -l
! gzcat segments-NA19017-concordant.vcf.gz | grep -v '#' | wc -l
      91
      23

3. Explore the gCNV calls

3.1 Plot histograms of event sizes and quality annotations

# Get description of gCNV workflow variant annotations.
# Those that start with 'Q' denote quality annotations.
! gzcat genotyped-segments-cohort24-NA19017.vcf.gz | grep '##FORMAT=<ID='
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Segment most-likely copy-number call">
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Segment genotype">
##FORMAT=<ID=NP,Number=1,Type=Integer,Description="Number of points (i.e. targets or bins) in the segment">
##FORMAT=<ID=QA,Number=1,Type=Integer,Description="Complementary Phred-scaled probability that all points (i.e. targets or bins) in the segment agree with the segment copy-number call">
##FORMAT=<ID=QS,Number=1,Type=Integer,Description="Complementary Phred-scaled probability that at least one point (i.e. target or bin) in the segment agrees with the segment copy-number call">
##FORMAT=<ID=QSE,Number=1,Type=Integer,Description="Complementary Phred-scaled probability that the segment end position is a genuine copy-number changepoint">
##FORMAT=<ID=QSS,Number=1,Type=Integer,Description="Complementary Phred-scaled probability that the segment start position is a genuine copy-number changepoint">
# Prepare the callsets for use with pandas
! gatk4100 VariantsToTable \
-V genotyped-segments-cohort24-NA19017.vcf.gz \
-O segments-NA19017.table \
-F CHROM -F POS -F END -GF CN -GF NP -GF QA -GF QS -GF QSE -GF QSS

! gatk4100 VariantsToTable \
-V segments-NA19017-concordant.vcf.gz \
-O segments-NA19017-concordant.table \
-F CHROM -F POS -F END -GF CN -GF NP -GF QA -GF QS -GF QSE -GF QSS
# Create pandas dataframe from table and subset to non-diploid segments for ALL calls
gcnv = pd.read_csv('segments-NA19017.table', sep='\t')
gcnvVar = pd.DataFrame(gcnv.loc[gcnv['NA19017.CN'] != 2])
gcnvVar
# Create pandas dataframe from table and subset to non-diploid segments for CONCORDANT calls
gcnvConcordant = pd.read_csv('segments-NA19017-concordant.table', sep='\t')
gcnvConcordantVar = pd.DataFrame(gcnvConcordant.loc[gcnvConcordant['NA19017.CN'] != 2])
gcnvConcordantVar
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
CHROM POS END NA19017.CN NA19017.NP NA19017.QA NA19017.QS NA19017.QSE NA19017.QSS
1 chr20 21306001 21308000 0 2 286 497 322 286
4 chr20 32938001 32941000 1 3 114 278 30 13
7 chr20 33699001 33701000 1 2 45 107 76 45
10 chr20 34654001 34656000 1 2 81 109 57 92
12 chr20 43643001 43644000 1 1 30 30 30 30
13 chr20 43644001 43645000 0 1 209 209 209 209
15 chr20 44668001 44669000 1 1 29 29 29 29
17 chr20 44698001 44699000 1 1 40 40 40 40
18 chr20 44699001 44700000 0 1 309 309 309 309
19 chr20 44700001 44701000 1 1 147 147 147 147
21 chr20 52143001 52144000 1 1 77 77 77 77

Note the concordant calls are all CN0 or CN1 events. The truthset provides only deletions!

# Plot the size distribution as a histogram. Here, utilize the fact that NP represents the number of bins.
# An extension of this analysis would be to draw histograms of insertions versus deletions. 
binSize = 1000
events = gcnvVar['NA19017.NP']*binSize
sns.distplot(events, kde=False, bins=np.linspace(1000,10000,10))

# Overlay event sizes of concordant calls
eventsConcordant = gcnvConcordantVar['NA19017.NP']*binSize
sns.distplot(eventsConcordant, kde=False, bins=np.linspace(1000,10000,10))

plt.suptitle('Histogram of gCNV event sizes', fontsize=24)
Text(0.5, 0.98, 'Histogram of gCNV event sizes')

# Similarly plot the quality scores QA and QS.
fig, axs = plt.subplots(1,2, figsize=(18,6))
qa = gcnvVar['NA19017.QA']
qs = gcnvVar['NA19017.QS']
sns.distplot(qa, ax=axs[0], bins=np.linspace(0,600,20), rug=True)
sns.distplot(qs, ax=axs[1], bins=np.linspace(0,600,20), rug=True)
axs[0].set_xlabel('QA', fontsize = 16)
axs[1].set_xlabel('QS', fontsize = 16)
plt.suptitle('Histograms of gCNV quality scores (all vs. concordant)', fontsize = 24)

axs[0].set_xlim(0,600)
axs[1].set_xlim(0,600)

# Overlay QA and QS histograms of concordant calls
qaConcordant = gcnvConcordantVar['NA19017.QA']
qsConcordant = gcnvConcordantVar['NA19017.QS']
sns.distplot(qaConcordant, ax=axs[0], bins=np.linspace(0,600,20), rug=True)
sns.distplot(qsConcordant, ax=axs[1], bins=np.linspace(0,600,20), rug=True)
<matplotlib.axes._subplots.AxesSubplot at 0x12c059048>

There is an ever-so-slight rightward shift in qualities for the concordant calls.

Could there be a correlation between event size and quality score?

3.2 Summarize metrics on gCNV calls by event type

# Collect some metrics
binSize = 1000
deletions = gcnvVar.loc[gcnvVar['NA19017.CN'] < 2]
insertions = gcnvVar.loc[gcnvVar['NA19017.CN'] > 2]

gcnvVarSummary = pd.DataFrame([np.count_nonzero(gcnvVar['NA19017.CN'] == 0)], columns=['cn0'])
gcnvVarSummary['cn1'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 1)
gcnvVarSummary['cn3'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 3)
gcnvVarSummary['cn4'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 4)
gcnvVarSummary['cn5'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 5)
gcnvVarSummary['n-total'] = len(gcnvVar.index)
gcnvVarSummary['n-dels'] = gcnvVarSummary['cn0'] + gcnvVarSummary['cn1']
gcnvVarSummary['n-amps'] = gcnvVarSummary['cn3'] + gcnvVarSummary['cn4'] + gcnvVarSummary['cn5']
gcnvVarSummary['bases-dels'] = deletions['NA19017.NP'].sum()*binSize
gcnvVarSummary['bases-amps'] = insertions['NA19017.NP'].sum()*binSize
gcnvVarSummary['avg-del-size'] = round(gcnvVarSummary['bases-dels'] / gcnvVarSummary['n-dels']).astype(int)
gcnvVarSummary['avg-amp-size'] = round(gcnvVarSummary['bases-amps'] / gcnvVarSummary['n-amps']).astype(int)
gcnvVarSummary['at1kb'] = len(gcnvVar[gcnvVar['NA19017.NP'] == 1])
gcnvVarSummary['fractionAt1kb'] = gcnvVarSummary['at1kb']/len(gcnvVar[gcnvVar['NA19017.NP'] >= 1])

gcnvVarSummary

Over half of the events are exactly at 1kbp, the coverage bin size. Most of the events are deletions. The next notebook analysis, Notebook#11686, examines gCNV callset annotations using bigger data.


Post edited by shlee on
Sign In or Register to comment.