The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# ValidateVariants

Member Posts: 1
edited July 2012

doesn't check the header if it conforms

## commtent text bla bla...

and ValidateVariants(1.6.2) confirmed that it is valid vcf.

BTW:
how do I write comments? Do I have to use

## comment= comment text

??

Tagged:

We just throw away invalid header lines. You should definitely be using vcftools to test whether a VCF file conforms to the specification. Validate Variants is used to test the data records for things that are actually allowed by the spec but which are actually inherently incorrect (e.g. the reference base is incorrect given the genome build, the wrong dbSNP ID is present).

For information about creating valid VCF files I refer you to the VCF spec itself where you should find everything you need.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 37

I have one VCF which I think is valid (it is from the Sanger) but it is failing ValidateVariants (following on from an older post I made on VariantRecalibrator). Running:
java -Xmx2g -jar /mnt/storage/system/usr/local/GenomeAnalysisTK-2.3-9/GenomeAnalysisTK.jar -R /mnt/storage/shared/genomes/GRCm38/GRCm38_68.fa -T ValidateVariants --variant C57BL6NJ.v3.snps.sorted.vcf
ValidateVariants gives me:
WARN 13:55:01,994 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 3875 seconds. Retrying connection.
INFO 13:55:02,706 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.NumberFormatException: For input string: "."
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.valueOf(Integer.java:582)
at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:92) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:130) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:120) at org.broadinstitute.sting.utils.variantcontext.GenotypesContext.iterator(GenotypesContext.java:461) at org.broadinstitute.sting.utils.variantcontext.VariantContext.validateAlternateAlleles(VariantContext.java:1063) at org.broadinstitute.sting.utils.variantcontext.VariantContext.extraStrictValidation(VariantContext.java:1032) at org.broadinstitute.sting.gatk.walkers.variantutils.ValidateVariants.validate(ValidateVariants.java:158) at org.broadinstitute.sting.gatk.walkers.variantutils.ValidateVariants.map(ValidateVariants.java:115) at org.broadinstitute.sting.gatk.walkers.variantutils.ValidateVariants.map(ValidateVariants.java:75) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:243)

##### ERROR ------------------------------------------------------------------------------------------

I'm unable to update our version of GATK at the moment.. I am just running vcftools vcf-validator on the VCF and so far it has given me:
INFO field at 1:3049306 .. INFO tag [CSQ] not listed in the header
I'll update this comment if vcftools gives any further information on the file.
Any tips greatly appreciated,
thanks
Lavinia.

Difficult to say based on this but it looks like there is a missing value that is not formatted the way it should be. The vcf validator tool should give you more details when it reaches the offending site. If you want to try to narrow down the problem with GATK in the meantime, you can try running ValidateVariants with -l DEBUG. That will give you more information on which interval the error is located in.

Geraldine Van der Auwera, PhD

• Member Posts: 37

Hi Geraldine, thanks for the debug tip. I have to say most of the output is incomprehensible to me, but the bits that look useful are:
DEBUG 07:47:10,103 GenomeLocParser - JH584299.1 (953012 bp)
INFO 07:47:10,106 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 07:47:10,106 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
DEBUG 07:47:10,236 TraverseLociNano - TraverseLoci.traverse: Shard is 1:1-1000000
DEBUG 07:47:10,249 TraverseLociNano - TraverseLoci.traverse: Shard is 1:1000001-2000000
DEBUG 07:47:10,249 TraverseLociNano - TraverseLoci.traverse: Shard is 1:2000001-3000000
DEBUG 07:47:10,250 TraverseLociNano - TraverseLoci.traverse: Shard is 1:3000001-4000000
DEBUG 07:47:10,361 GATKRunReport - Aggregating data for run report
..
then a bit about Amazon http stuff then bang, same error again.
I'll do a bit more tinkering but I think I might just not worry about using ValidateVariants. Thanks very much for your help.

• Member Posts: 37

Ok, I am going to have to give up with ValidateVariants. For the record I was using the VCF file mgp.v3.snps.rsIDdbSNPv137.vcf from the sanger mouse resources (ftp://ftp-mouse.sanger.ac.uk/current_snps/) with their reference (ftp://ftp-mouse.sanger.ac.uk/ref/) (sorted with vcftools as the order of the VCF doesn't agree with the reference). Thanks.

Hi Lavinia,

Did you have any success at all running the vcftools validation tool?

FYI, based on the debug info you posted, it looks like your problem is located somewhere on chromosome 1, between positions 3000001 and 4000000. You may run into issues using this vcf as input to other GATK tools, so if you do that interval should be the first place to look.

Don't worry about the Amazon/ http stuff, that's just the automatic reporting (phone home feature); even if it fails that won't impact your run. If you find that it's always failing, there may be something like a firewall that prevents the reporting system from communicating with the cloud service we use to collect the reports (or your machine is not connected to the internet). If so you can request a key to deactivate it completely.

Geraldine Van der Auwera, PhD

• Member Posts: 37

Hi, vcftools ran, with the only output the one I listed above, "INFO field at 1:3049306 .. INFO tag [CSQ]". Thanks for pointing out the possible locations, I'll see if I can narrow it down. Thanks.

Hi Lavina,

I was wondering if you ever solved your issue? I am using the same file from Sanger and have the same cryptic error. Thanks is advance for any help you can provide.

Aaron

• Member Posts: 37

Hi Aaron, no I didn't. That was with an older version of GATK and I have just updated to the most recent version. I hope to revisit this project and will have another go at this VCF file, more experience with the data so I've got a better idea of what to look for now. I'll post/let you know if I can narrow down the problem.

• Member Posts: 27

I am having the same issue with the tool "SelectVariants".

The command line looks like this:
java -Xmx2g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -R /usr/users/bharr/ILLUMINA/Mus_musculus.GRCm38.74.dna.chromosome.fa -T SelectVariants --variant mgp.v3.snps.rsIDdbSNPv137.vcf -o WSBEiJ.vcf -sn WSBEiJ -env -ef

The error is this:

##### ERROR MESSAGE: Key CSQ found in VariantContext field INFO at 1:3050087 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

The CSQ tag is for sure defined in the vcf file:

## INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT alleles from Ensembl 70 VEP v2.8, format transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">

The SNP with the CSQ tag where "SelectVariants" is complaining about looks like this:

1 3050087 . C T 246.28 PASS AC1=0;AC=2;AF1=0;AN=36;DP4=47,397,1,14;DP=546;MDV=8;MQ=52;MSD=26;PV0=0.15;PV1=4.3e-05;PV2=0.02;PV3=0.049;PV4=0.15,4.3e-05,0.02,0.049;QD=0.2341;SB=0.9583;VDB=0.0106;CSQ=ENSMUST00000160944:ENSMUSG00000090025:upstream_gene_variant:Allele,T:Gene,Gm16088 GT:GQ:DP:SP:PL:FI 0/0:.:25:0:0,.,.:1 0/0:.:38:0:0,.,.:1 0/0:.:10:0:0,.,.:1 0/0:.:24:0:0,.,.:1 0/0:.:28:0:0,.,.:1 0/0:.:25:0:0,.,.:1 0/0:.:27:0:0,.,.:1 0/0:.:21:0:0,.,.:1 0/0:.:26:8:0,.,.:1 0/0:.:35:0:0,.,.:1 0/0:.:21:0:0,.,.:1 0/0:.:32:0:0,.,.:1 0/0:.:20:0:0,.,.:1 0/0:.:15:0:0,.,.:1 0/0:.:29:0:0,.,.:1 0/0:.:23:0:0,.,.:1 0/0:.:42:0:0,.,.:1 1/1:60:18:0:59,0,14:1

Is there anything that can be done about that?

THANKS

Bettina

The parser might be choking on some characters in the header. E.g. if you have square brackets anywhere in the definitions (e.g. in a FILTER field to indicate "greater than") that can mess up parsing for following lines -- I think I remember seeing that before. If that's all there is you can either edit the definitions or run with -U LENIENT_VCF_PROCESSING to allow the program to disregard the error.