Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Funcotator user-defined data sources

justsangsuejustsangsue New YorkMember
Hi, recently I was trying expand the annotation data source when using funcotator, however, the document didn't give much information or example. Now I was trying to add CADD to the data source folder. After running the funcotator, I got the error:

org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path file: funcotator_dataSources.v1.6.20190124s/cadd/hg19/cadd.config
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:353)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:305)
at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:256)
at org.broadinstitute.hellbender.engine.FeatureManager.addToFeatureSources(FeatureManager.java:234)
at org.broadinstitute.hellbender.engine.GATKTool.addFeatureInputsAfterInitialization(GATKTool.java:957)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.DataSourceUtils.createAndRegisterFeatureInputs(DataSourceUtils.java:328)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.DataSourceUtils.createDataSourceFuncotationFactoriesForDataSources(DataSourceUtils.java:277)
at org.broadinstitute.hellbender.tools.funcotator.Funcotator.onTraversalStart(Funcotator.java:774)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1037)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Duplicate key 0, for input source: cadd.config
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:263)
at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:102)
at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:127)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:120)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:350)
... 14 more
Caused by: java.lang.IllegalStateException: Duplicate key 0
at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
at java.util.HashMap.merge(HashMap.java:1254)
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.utils.codecs.xsvLocatableTable.XsvLocatableTableCodec.readActualHeader(XsvLocatableTableCodec.java:341)
at org.broadinstitute.hellbender.utils.codecs.xsvLocatableTable.XsvLocatableTableCodec.readActualHeader(XsvLocatableTableCodec.java:64)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:261)
... 18 more

java version:
java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

I added the cadd folder into data source folder like the structure mentioned in document:

cadd
|- hg19
| |- cadd.config
| |- InDels_inclAnno.tsv
| |- InDels_inclAnno.tsv.gz.tbi
|
|- hg38
| |- cadd.config
| |- InDels_inclAnno.tsv
| |- InDels_inclAnno.tsv.gz.tbi

The config file (cadd.config)

name = CADD
version = v1.4
src_file = InDels_inclAnno.tsv
origin_location =
preprocessing_script = UNKNOWN

# Whether this data source is for the b37 reference.
# Required and defaults to false.
isB37DataSource = false

# Supported types:
# simpleXSV -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID
# locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location
# gencode -- Custom datasource class for GENCODE
# cosmic -- Custom datasource class for COSMIC
# vcf -- Custom datasource class for Variant Call Format (VCF) files
type = locatableXSV

# Required field for GENCODE files.
# Path to the FASTA file from which to load the sequences for GENCODE transcripts:
gencode_fasta_path =

# Required field for GENCODE files.
# NCBI build version (either hg19 or hg38):
ncbi_build_version =

# Required field for simpleXSV files.
# Valid values:
# GENE_NAME
# TRANSCRIPT_ID
xsv_key = GENE_NAME

# Required field for simpleXSV files.
# The 0-based index of the column containing the key on which to match
xsv_key_column =

# Required field for simpleXSV AND locatableXSV files.
# The delimiter by which to split the XSV file into columns.
xsv_delimiter = \t

# Required field for simpleXSV files.
# Whether to permissively match the number of columns in the header and data rows
# Valid values:
# true
# false
xsv_permissive_cols =

# Required field for locatableXSV files.
# The 0-based index of the column containing the contig for each row
contig_column = 0

# Required field for locatableXSV files.
# The 0-based index of the column containing the start position for each row
start_column = 1

# Required field for locatableXSV files.
# The 0-based index of the column containing the end position for each row
end_column = 1

A snapshot of InDels_inclAnno.tsv:
Chrom Pos Ref Alt Type Length AnnoType Consequence ConsScore ConsDetail GC CpG motifECount motifEName
motifEHIPos motifEScoreChng oAA nAA GeneID FeatureID GeneName CCDS Intron Exon cDNApos relcDNApos CDSpos relCDSpo
s protPos relProtPos Domain Dst2Splice Dst2SplType minDistTSS minDistTSE SIFTcat SIFTval PolyPhenCat PolyPhenVal priPhC
ons mamPhCons verPhCons priPhyloP mamPhyloP verPhyloP bStatistic targetScan mirSVR-Score mirSVR-E mirS
VR-Aln cHmm_E1 cHmm_E2 cHmm_E3 cHmm_E4 cHmm_E5 cHmm_E6 cHmm_E7 cHmm_E8 cHmm_E9 cHmm_E10 cHmm_E11 cHmm_E12 cHmm_E13 cHmm_E14
cHmm_E15 cHmm_E16 cHmm_E17 cHmm_E18 cHmm_E19 cHmm_E20 cHmm_E21 cHmm_E22 cHmm_E23 cHmm_E
24 cHmm_E25 GerpRS GerpRSpval GerpN GerpS tOverlapMotifs motifDist EncodeH3K4me1-sum EncodeH3K4me1-max EncodeH3K4me
2-sum EncodeH3K4me2-max EncodeH3K4me3-sum EncodeH3K4me3-max EncodeH3K9ac-sum EncodeH3K9ac-max EncodeH3K9me3-sum En
codeH3K9me3-max EncodeH3K27ac-sum EncodeH3K27ac-max EncodeH3K27me3-sum EncodeH3K27me3-max EncodeH3K36me3-sum EncodeH3K36me3-m
ax EncodeH3K79me2-sum EncodeH3K79me2-max EncodeH4K20me1-sum EncodeH4K20me1-max EncodeH2AFZ-sum EncodeH2AFZ-max EncodeDNase-sum Encode
DNase-max EncodetotalRNA-sum EncodetotalRNA-max Grantham Dist2Mutation Freq100bp Rare100bp Sngl100bp Freq1000bp Rare
1000bp Sngl1000bp Freq10000bp Rare10000bp Sngl10000bp EnsembleRegulatoryFeature dbscSNV-ada_score dbscSNV-rf_score Re
mapOverlapTF RemapOverlapCL RawScore PHRED

1 10001 T TC INS 1 RegulatoryFeature REGULATORY 4 regulatory 0.448933333333 0.00993288590604 NA
NA NA NA NA NA NA ENSR00000344265 NA NA NA NA NA NA NA NA NA NA NA NA NA 1869 3670 NA NA NA NA NA NA NA NA NA NA 994 NA NA NA NA 0.008 0.000 0.000 0.000 0.016 0.000 0.024 0.087 0.472 0.000 0.000 0.000 0.000 0.000 0.394 NA NA 0 0 NA NA
NA NA NA GM1 10.04 2.84 8.0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2773 NA NA NA NA NA NA 3 2 32 NA NA -0.083014 1.567

The funcotator command:
gatk Funcotator \
> --variant TriLevelv2_bqsr-filtered.vcf \
> --output test_cadd.vcf \
> --reference hg19.fa \
> --data-sources-path /funcotator_dataSources.v1.6.20190124s \
> --ref-version hg19 \
> --output-file-format VCF \
> --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' \
> --verbosity DEBUG \
> --disable-sequence-dictionary-validation true \
> --disable-bam-index-caching true

I am not sure what I missed here, although I am not quite sure about how should to add new data sources. Sincerely appreciate your help!
Tagged:

Issue · Github
by bhanuGandham

Issue Number
6223
State
open
Last Updated
Assignee
Array
Milestone
Array

Answers

Sign In or Register to comment.