We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

(howto) Create custom datasources for Oncotator

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited June 2014 in Oncotator documentation


Supported input formats

1. TSV

indexed by a transcript_id (transcript_tsv), gene (gene_tsv), genomic position (gp_tsv), or amino acid position and gene (gpp_tsv). Additionally, you can index by genomic position, and optionally, reference allele and alt allele, using indexed_tsv.

Use indexed_tsv when the contents of the tsv file are too large to fit in RAM.

2. VCF

indexed by position and, optionally, reference and alternate alleles. The RAM footprint is small.

Starting from a tsv (tab-separated value) files

In this type of file, the annotations are based on gene, genomic position, gene-amino acid position, or transcript ID. See this doc for details on the format requirements.

Let's start with this example of a minimalist Tsv input file indexed by transcript ID for RefSeq:

gaf_transcript_id mRNA_Id prot_Id
uc001hms.3 NM_022746 NP_073583
uc001hmt.3 NM_022746 NP_073583

the index column would be gaf_transcript_id and the ds_type would be transcript_tsv.

Creating the datasource

You will use the utility named initializeDatasource, which was included in the oncotator installation process (see initializeDatasource --help for more extensive usage instructions). The command structure is the following:

$ initializeDatasource --ds_type {gp_tsv,gene_tsv,transcript_tsv} --ds_file ds_file --name name --version version --dsDir dbDir --ds_foldername ds_foldername --genome_build {hg19} --index_columns index_columns  

where the { } characters indicate enumerated options.


Up-to-date and additional examples can be found by executing initializeDatasource --help

1. Create a datasource for ORegAnno (a generic genome position tsv)

Based on the command structure shown above, just fill in the appropriate names and parameters:

$ initializeDataSource --ds_type gp_tsv --ds_file oreganno_trim.hg19.txt --name ORegAnno --version "UCSC Track" --dbDir ~/oncotest --ds_foldername oreganno --genome_build hg19 --index_columns hg19.oreganno.chrom,hg19.oreganno.chromStart,hg19.oreganno.chromEnd  

This will produce the appropriate database building files in the ~/oncotest/oreganno directory. It's important to note that the parent directory ~/oncotest and the created datasource directory oreganno are two separate parameters.

2. Create a MutSig Published Results datasource (a gene tsv)

Again, the parent directory ~/oncotest and the created datasource directory mutsig are two separate parameters.

$ initializeDataSource --ds_type gene_tsv --ds_file mutsig_results.import.20110905.txt --name "MutSig_Published_Results" --version "20110905" --dbDir ~/oncotest --ds_foldername  mutsig --genome_build hg19 --indox_columns gene

3. Create a datasource using Exome Seq. Project (ESP) data that is in variant call format (VCF)

An example using a VCF:

$ initializeDatasource --ds_type indexed_vcf --ds_file ESP6500SI-V2.vcf --name ESP --version 6500SI-V2 --dbDir ~/oncotest_ESP6500SI-V2 --genome_build hg19 --match_mode exact --ds_foldername ~/ESP6500SI-V2_exact
Post edited by LeeTL1220 on


Sign In or Register to comment.