If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(howto) Create custom datasources for Oncotator

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited June 2014 in Oncotator documentation


Supported input formats

1. TSV

indexed by a transcript_id (transcript_tsv), gene (gene_tsv), genomic position (gp_tsv), or amino acid position and gene (gpp_tsv). Additionally, you can index by genomic position, and optionally, reference allele and alt allele, using indexed_tsv.

Use indexed_tsv when the contents of the tsv file are too large to fit in RAM.

2. VCF

indexed by position and, optionally, reference and alternate alleles. The RAM footprint is small.

Starting from a tsv (tab-separated value) files

In this type of file, the annotations are based on gene, genomic position, gene-amino acid position, or transcript ID. See this doc for details on the format requirements.

Let's start with this example of a minimalist Tsv input file indexed by transcript ID for RefSeq:

gaf_transcript_id mRNA_Id prot_Id
uc001hms.3 NM_022746 NP_073583
uc001hmt.3 NM_022746 NP_073583

the index column would be gaf_transcript_id and the ds_type would be transcript_tsv.

Creating the datasource

You will use the utility named initializeDatasource, which was included in the oncotator installation process (see initializeDatasource --help for more extensive usage instructions). The command structure is the following:

$ initializeDatasource --ds_type {gp_tsv,gene_tsv,transcript_tsv} --ds_file ds_file --name name --version version --dsDir dbDir --ds_foldername ds_foldername --genome_build {hg19} --index_columns index_columns  

where the { } characters indicate enumerated options.


Up-to-date and additional examples can be found by executing initializeDatasource --help

1. Create a datasource for ORegAnno (a generic genome position tsv)

Based on the command structure shown above, just fill in the appropriate names and parameters:

$ initializeDataSource --ds_type gp_tsv --ds_file oreganno_trim.hg19.txt --name ORegAnno --version "UCSC Track" --dbDir ~/oncotest --ds_foldername oreganno --genome_build hg19 --index_columns hg19.oreganno.chrom,hg19.oreganno.chromStart,hg19.oreganno.chromEnd  

This will produce the appropriate database building files in the ~/oncotest/oreganno directory. It's important to note that the parent directory ~/oncotest and the created datasource directory oreganno are two separate parameters.

2. Create a MutSig Published Results datasource (a gene tsv)

Again, the parent directory ~/oncotest and the created datasource directory mutsig are two separate parameters.

$ initializeDataSource --ds_type gene_tsv --ds_file mutsig_results.import.20110905.txt --name "MutSig_Published_Results" --version "20110905" --dbDir ~/oncotest --ds_foldername  mutsig --genome_build hg19 --indox_columns gene

3. Create a datasource using Exome Seq. Project (ESP) data that is in variant call format (VCF)

An example using a VCF:

$ initializeDatasource --ds_type indexed_vcf --ds_file ESP6500SI-V2.vcf --name ESP --version 6500SI-V2 --dbDir ~/oncotest_ESP6500SI-V2 --genome_build hg19 --match_mode exact --ds_foldername ~/ESP6500SI-V2_exact
Post edited by LeeTL1220 on


Sign In or Register to comment.