ValidateVariants

MatthiasMatthias Posts: 1Member
edited July 2012 in Ask the GATK team

doesn't check the header if it conforms

=

By mistake I just added some comments to the header like

commtent text bla bla...

and ValidateVariants(1.6.2) confirmed that it is valid vcf. but vcf-tools complained about it

BTW: how do I write comments? Do I have to use

comment= comment text

??

Answers

  • ebanksebanks Posts: 683GATK Developer mod

    We just throw away invalid header lines. You should definitely be using vcftools to test whether a VCF file conforms to the specification. Validate Variants is used to test the data records for things that are actually allowed by the spec but which are actually inherently incorrect (e.g. the reference base is incorrect given the genome build, the wrong dbSNP ID is present).

    For information about creating valid VCF files I refer you to the VCF spec itself where you should find everything you need.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • LaviniaLavinia Posts: 37Member

    I have one VCF which I think is valid (it is from the Sanger) but it is failing ValidateVariants (following on from an older post I made on VariantRecalibrator). Running: java -Xmx2g -jar /mnt/storage/system/usr/local/GenomeAnalysisTK-2.3-9/GenomeAnalysisTK.jar -R /mnt/storage/shared/genomes/GRCm38/GRCm38_68.fa -T ValidateVariants --variant C57BL6NJ.v3.snps.sorted.vcf ValidateVariants gives me: WARN 13:55:01,994 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 3875 seconds. Retrying connection. INFO 13:55:02,706 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.NumberFormatException: For input string: "." at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.valueOf(Integer.java:582) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec.decodeInts(AbstractVCFCodec.java:680) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:641) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:92) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:130) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:120) at org.broadinstitute.sting.utils.variantcontext.GenotypesContext.iterator(GenotypesContext.java:461) at org.broadinstitute.sting.utils.variantcontext.VariantContext.validateAlternateAlleles(VariantContext.java:1063) at org.broadinstitute.sting.utils.variantcontext.VariantContext.extraStrictValidation(VariantContext.java:1032) at org.broadinstitute.sting.gatk.walkers.variantutils.ValidateVariants.validate(ValidateVariants.java:158) at org.broadinstitute.sting.gatk.walkers.variantutils.ValidateVariants.map(ValidateVariants.java:115) at org.broadinstitute.sting.gatk.walkers.variantutils.ValidateVariants.map(ValidateVariants.java:75) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:243) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:231) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:120) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:67) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:23) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
    ERROR
    ERROR Please visit the wiki to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: For input string: "."
    ERROR ------------------------------------------------------------------------------------------

    I'm unable to update our version of GATK at the moment.. I am just running vcftools vcf-validator on the VCF and so far it has given me: INFO field at 1:3049306 .. INFO tag [CSQ] not listed in the header I'll update this comment if vcftools gives any further information on the file. Any tips greatly appreciated, thanks Lavinia.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,461Administrator, GATK Developer admin

    Difficult to say based on this but it looks like there is a missing value that is not formatted the way it should be. The vcf validator tool should give you more details when it reaches the offending site. If you want to try to narrow down the problem with GATK in the meantime, you can try running ValidateVariants with -l DEBUG. That will give you more information on which interval the error is located in.

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    Hi Geraldine, thanks for the debug tip. I have to say most of the output is incomprehensible to me, but the bits that look useful are: DEBUG 07:47:10,103 GenomeLocParser - JH584299.1 (953012 bp) INFO 07:47:10,106 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 07:47:10,106 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining DEBUG 07:47:10,236 TraverseLociNano - TraverseLoci.traverse: Shard is 1:1-1000000 DEBUG 07:47:10,249 TraverseLociNano - TraverseLoci.traverse: Shard is 1:1000001-2000000 DEBUG 07:47:10,249 TraverseLociNano - TraverseLoci.traverse: Shard is 1:2000001-3000000 DEBUG 07:47:10,250 TraverseLociNano - TraverseLoci.traverse: Shard is 1:3000001-4000000 DEBUG 07:47:10,361 GATKRunReport - Aggregating data for run report .. then a bit about Amazon http stuff then bang, same error again. I'll do a bit more tinkering but I think I might just not worry about using ValidateVariants. Thanks very much for your help.

  • LaviniaLavinia Posts: 37Member

    Ok, I am going to have to give up with ValidateVariants. For the record I was using the VCF file mgp.v3.snps.rsIDdbSNPv137.vcf from the sanger mouse resources (ftp://ftp-mouse.sanger.ac.uk/current_snps/) with their reference (ftp://ftp-mouse.sanger.ac.uk/ref/) (sorted with vcftools as the order of the VCF doesn't agree with the reference). Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,461Administrator, GATK Developer admin

    Hi Lavinia,

    Did you have any success at all running the vcftools validation tool?

    FYI, based on the debug info you posted, it looks like your problem is located somewhere on chromosome 1, between positions 3000001 and 4000000. You may run into issues using this vcf as input to other GATK tools, so if you do that interval should be the first place to look.

    Don't worry about the Amazon/ http stuff, that's just the automatic reporting (phone home feature); even if it fails that won't impact your run. If you find that it's always failing, there may be something like a firewall that prevents the reporting system from communicating with the cloud service we use to collect the reports (or your machine is not connected to the internet). If so you can request a key to deactivate it completely.

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    Hi, vcftools ran, with the only output the one I listed above, "INFO field at 1:3049306 .. INFO tag [CSQ]". Thanks for pointing out the possible locations, I'll see if I can narrow it down. Thanks.

  • amberlinamberlin Broad InstitutePosts: 6Member

    Hi Lavina,

    I was wondering if you ever solved your issue? I am using the same file from Sanger and have the same cryptic error. Thanks is advance for any help you can provide.

    Aaron

  • LaviniaLavinia Posts: 37Member

    Hi Aaron, no I didn't. That was with an older version of GATK and I have just updated to the most recent version. I hope to revisit this project and will have another go at this VCF file, more experience with the data so I've got a better idea of what to look for now. I'll post/let you know if I can narrow down the problem.

  • Bettina_HarrBettina_Harr Posts: 22Member

    I am having the same issue with the tool "SelectVariants".

    The command line looks like this: java -Xmx2g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -R /usr/users/bharr/ILLUMINA/Mus_musculus.GRCm38.74.dna.chromosome.fa -T SelectVariants --variant mgp.v3.snps.rsIDdbSNPv137.vcf -o WSBEiJ.vcf -sn WSBEiJ -env -ef

    The error is this:

    ERROR MESSAGE: Key CSQ found in VariantContext field INFO at 1:3050087 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

    The CSQ tag is for sure defined in the vcf file:

    INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT alleles from Ensembl 70 VEP v2.8, format transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">

    The SNP with the CSQ tag where "SelectVariants" is complaining about looks like this:

    1 3050087 . C T 246.28 PASS AC1=0;AC=2;AF1=0;AN=36;DP4=47,397,1,14;DP=546;MDV=8;MQ=52;MSD=26;PV0=0.15;PV1=4.3e-05;PV2=0.02;PV3=0.049;PV4=0.15,4.3e-05,0.02,0.049;QD=0.2341;SB=0.9583;VDB=0.0106;CSQ=ENSMUST00000160944:ENSMUSG00000090025:upstream_gene_variant:Allele,T:Gene,Gm16088 GT:GQ:DP:SP:PL:FI 0/0:.:25:0:0,.,.:1 0/0:.:38:0:0,.,.:1 0/0:.:10:0:0,.,.:1 0/0:.:24:0:0,.,.:1 0/0:.:28:0:0,.,.:1 0/0:.:25:0:0,.,.:1 0/0:.:27:0:0,.,.:1 0/0:.:21:0:0,.,.:1 0/0:.:26:8:0,.,.:1 0/0:.:35:0:0,.,.:1 0/0:.:21:0:0,.,.:1 0/0:.:32:0:0,.,.:1 0/0:.:20:0:0,.,.:1 0/0:.:15:0:0,.,.:1 0/0:.:29:0:0,.,.:1 0/0:.:23:0:0,.,.:1 0/0:.:42:0:0,.,.:1 1/1:60:18:0:59,0,14:1

    Is there anything that can be done about that?

    THANKS

    Bettina

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,461Administrator, GATK Developer admin

    Hi @Bettina_Harr,

    The parser might be choking on some characters in the header. E.g. if you have square brackets anywhere in the definitions (e.g. in a FILTER field to indicate "greater than") that can mess up parsing for following lines -- I think I remember seeing that before. If that's all there is you can either edit the definitions or run with -U LENIENT_VCF_PROCESSING to allow the program to disregard the error.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.