We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
VAriantAnnotator error: java.lang.NumberFormatException: For input string: "."

Hi I am going through the GATK Best Practices pipeline for the first time. I have non-human data and am at the variant discovery step in the pipeline. I don't have well-vetted 'known' variant sites but have pieced together a vcf with high-likelihood snps that are in common among several data sets in our lab over the past year. The vcf has only four basic annotations and it looks like the software that produced it doesn't provide for generating others so I am trying to add a few with VariantAnnotator. My command is
java -jar GenomeAnalysisTK.jar \
-R Oncorhynchus_mykiss_chr.fa \
-T VariantAnnotator \
-I Om2013all-rg.bam \
-A StrandOddsRatio \
-A MappingQualityRankSumTest \
-A ReadPosRankSumTest \
-o Om2013bElp.vcf \
-V Om2013bEl.vcf
When I run it I get an error:
ERROR stack trace
java.lang.NumberFormatException: For input string: "."
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at htsjdk.variant.variantcontext.GenotypeLikelihoods.parseDeprecatedGLString(GenotypeLikelihoods.java:251)
at htsjdk.variant.variantcontext.GenotypeLikelihoods.fromGLField(GenotypeLikelihoods.java:81)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:715)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:128)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:347)
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:279)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:257)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:60)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:41)
at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:502)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:403)
at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:401)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:287)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:224)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:147)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1047)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:828)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: "."
ERROR ------------------------------------------------------------------------------------------
I've read everything I can find on VariantAnnotator on the GATK website but haven't found anything helpful yet. I am running
The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12.
I suspect that I am doing something wrong but can't track it down. Any ideas about what is or isn't going on?
Thanks,
Sewall
Best Answer
-
pdexheimer ✭✭✭✭
I suspect that even if these records were parsed without error, they would be grossly misinterpreted and probably throw a different error somewhere downstream.
GL is one of the standard FORMAT annotations defined in the spec. The spec defines it as having a Number of G (one entry per genotype), and that seems to be how it's used here as well. The spec also defines GL values as the log-likelihood for each genotype. The data posted above is in a strange range - if it were linear, I'd expect likelihoods to lie between 0 and 1. If it were log, it should always be negative. But the data above is positive and much larger than 1. Annotations from the spec are actually processed by htsjdk (iirc, the header definition is ignored and the standard one is used, with a warning), and IIRC it will internally convert GL values to PL values (i.e., Phred-scale and relativize them). The data as posted above really doesn't make sense, and I think that even if the dots were replaced by an arbitrary number (0?), problems would still crop up downstream.
I agree with Geraldine's conclusion about the dots. Every single record above seems to only display records for the heterozygous genotype, and use the "missing value" for the two homozygous genotypes. I don't think the spec ever explicitly states this, but my feeling is that you can't have missing values as elements of an array - either the entire annotation is missing, or it's reported fully. That's clearly the way htsjdk interprets things...
Answers
My guess is that your VCF is malformed. Some annotation programs output VCFs that don't conform to the spec. Can you post the header lines showing the dfeinitions of the annotations, and a few VCF records showing annotation values?
Geraldine
Thank for the prompt response. Here is the header and a few lines of data.
fileformat=VCFv4.0
fileDate=20151003
source="Stacks v1.30"
INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 11DN0004-OmyWGS_2014 11DN0005-OmyWGS_2014 11DN0006-OmyWGS_2014 11DN0008-OmyWGS_2014 11DN0011-OmyWGS_2014 11DN0013-OmyWGS_2014 11DN0017-OmyWGS_2014 MYkALH12_F003-OmyWGS_2014 MYkALH12_M001-OmyWGS_2014 MYkALN12_F001-OmyWGS_2014 MYkALN12_F002-OmyWGS_2014 MYkALN12_M001-OmyWGS_2014 MYkALN12_M004-OmyWGS_2014 MYkALN12_M005-OmyWGS_2014 MyReit12x_0012-OmyWGS_2014 13BL0003-OmyWGS_2014 13BL0007-OmyWGS_2014 MyToku12x_0001-OmyWGS_2014 MyToku12x_0013-OmyWGS_2014
chrUn_1 19860 21518 C T . PASS NS=19;AF=0.974,0.026 GT:DP:AD:GL 0/0:55:55,0:.,76.25,. 0/0:38:38,0:.,52.68,. 0/0:44:44,0:.,61,. 0/0:33:33,0:.,45.75,. 0/0:27:27,0:.,37.43,. 0/0:37:37,0:.,51.29,. 1/0:62:36,26:.,85.95,. 0/0:42:42,0:.,58.22,. 0/0:34:34,0:.,47.13,. 0/0:31:31,0:.,42.98,. 0/0:34:34,0:.,47.13,. 0/0:15:15,0:.,20.79,. 0/0:25:25,0:.,34.66,. 0/0:40:40,0:.,55.45,. 0/0:22:22,0:.,30.5,. 0/0:41:41,0:.,56.84,. 0/0:51:51,0:.,70.7,. 0/0:25:25,0:.,34.66,. 0/0:12:12,0:.,16.64,.
chrUn_1 44097 23124 C A . PASS NS=19;AF=0.500,0.500 GT:DP:AD:GL 0/1:32:12,20:.,44.36,. 0/1:34:12,22:.,47.13,. 0/1:32:12,20:.,44.36,. 0/1:19:12,7:.,26.34,. 0/1:24:9,15:.,33.27,. 0/1:37:14,23:.,51.29,. 0/1:18:8,10:.,24.95,. 0/1:30:9,21:.,41.59,. 0/1:13:6,7:.,18.02,. 0/1:29:8,21:.,41.59,. 0/1:45:13,32:.,63.77,. 0/1:20:11,9:.,27.73,. 0/1:44:11,33:.,61,. 0/1:16:6,10:.,22.18,. 0/1:41:18,23:.,56.84,. 0/1:41:19,22:.,56.84,. 0/1:45:18,27:.,62.38,. 0/1:51:22,29:.,70.7,. 0/1:33:13,20:.,45.75,.
chrUn_1 44181 23123 G A . PASS NS=19;AF=0.921,0.079 GT:DP:AD:GL 0/0:44:44,0:.,61,. 0/0:38:38,0:.,52.68,. 0/0:38:25,0:.,54.07,. 0/0:30:30,0:.,41.59,. 0/0:29:12,0:.,40.2,. 1/0:34:13,21:.,48.52,. 0/0:37:37,0:.,51.29,. 0/0:42:42,0:.,58.22,. 0/0:34:34,0:.,47.13,. 0/0:39:24,0:.,54.07,. 0/0:34:15,0:.,47.13,. 0/0:26:7,0:.,36.04,. 1/0:30:15,15:.,41.59,. 1/0:41:23,18:.,56.84,. 0/0:40:19,0:.,55.45,. 0/0:43:22,0:.,59.61,. 0/0:45:45,0:.,62.38,. 0/0:31:31,0:.,42.98,. 0/0:24:24,0:.,33.27,.
chrUn_1 44204 23123 G A . PASS NS=19;AF=0.526,0.474 GT:DP:AD:GL 1/1:44:0,44:.,61,. 1/1:38:0,38:.,52.68,. 0/0:38:25,0:.,54.07,. 1/1:30:0,30:.,41.59,. 1/0:29:12,17:.,40.2,. 1/0:34:13,21:.,48.52,. 1/1:37:0,37:.,51.29,. 0/0:42:42,0:.,58.22,. 0/0:34:34,0:.,47.13,. 1/0:39:24,15:.,54.07,. 0/0:34:15,0:.,47.13,. 1/0:26:7,19:.,36.04,. 1/0:30:15,15:.,41.59,. 1/1:41:0,23:.,56.84,. 0/0:40:19,0:.,55.45,. 1/0:43:22,21:.,59.61,. 1/1:45:0,45:.,62.38,. 0/0:31:31,0:.,42.98,. 0/0:24:24,0:.,33.27,.
chrUn_1 44217 23123 C T . PASS NS=19;AF=0.789,0.211 GT:DP:AD:GL 0/0:44:44,0:.,61,. 0/0:38:38,0:.,52.68,. 0/1:38:13,25:.,54.07,. 0/0:30:30,0:.,41.59,. 0/1:29:17,12:.,40.2,. 0/1:34:21,13:.,48.52,. 0/0:37:37,0:.,51.29,. 0/0:42:42,0:.,58.22,. 0/0:34:34,0:.,47.13,. 0/0:39:24,0:.,54.07,. 0/1:34:19,15:.,47.13,. 0/1:26:19,7:.,36.04,. 0/0:30:15,0:.,41.59,. 0/0:41:23,0:.,56.84,. 0/1:40:21,19:.,55.45,. 0/0:43:22,0:.,59.61,. 0/0:45:45,0:.,62.38,. 1/1:31:0,31:.,42.98,. 0/0:24:24,0:.,33.27,.
It looks like the leading "##" got dropped when I pasted in the snippet. The header lines begin with "##" except the "# CHROM .." line.
@syoung
Hi Sewall,
Can you tell me how you generated the input VCF?
Thanks,
Sheila
I'm guessing this was produced by Stacks since the source line says
source="Stacks v1.30"
.The error is happening when we're trying to parse the GL field which is defined by the program in the header as:
The program is trying to parse a number but finding a
.
instead. I'm not sure if this is a case of the field type being specified incorrectly in the header, or.
not being allowed for a missing GL value. You can try filling in number values in one or the other and see whether one fixes the error.I suspect that even if these records were parsed without error, they would be grossly misinterpreted and probably throw a different error somewhere downstream.
GL is one of the standard FORMAT annotations defined in the spec. The spec defines it as having a Number of G (one entry per genotype), and that seems to be how it's used here as well. The spec also defines GL values as the log-likelihood for each genotype. The data posted above is in a strange range - if it were linear, I'd expect likelihoods to lie between 0 and 1. If it were log, it should always be negative. But the data above is positive and much larger than 1. Annotations from the spec are actually processed by htsjdk (iirc, the header definition is ignored and the standard one is used, with a warning), and IIRC it will internally convert GL values to PL values (i.e., Phred-scale and relativize them). The data as posted above really doesn't make sense, and I think that even if the dots were replaced by an arbitrary number (0?), problems would still crop up downstream.
I agree with Geraldine's conclusion about the dots. Every single record above seems to only display records for the heterozygous genotype, and use the "missing value" for the two homozygous genotypes. I don't think the spec ever explicitly states this, but my feeling is that you can't have missing values as elements of an array - either the entire annotation is missing, or it's reported fully. That's clearly the way htsjdk interprets things...
Thanks to all of you for responding. Your explanations make sense. For my immediate needs I'm better off using hard-filtering.