VCF file malformed

Hi all. I am new in bioinformatics. Now I am trying to do a base re calibration and this is why I need a dbsnp.vcf with the known variants of the genome I am working. I downloaded the dbsnp file from ncbi:

I create a file with the URL of all chromosomes

cat > download-file-list.txt

Download all the files

wget -i download-file-list.txt

unzip

gunzip -dk *.gz

bgzip

parallel bgzip {} ::: *.vcf

Index the file

parallel tabix -p vcf {} ::: *.vcf.gz

Concat files in a single one

vcf-concat *.vcf.gz | gzip -c > Reference_dbsnp.vcf.gz

Then, I validate the variants with the reference file

java -d64 -Xmx48g -jar /etc/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar \
-T ValidateVariants \
-R /home/mbxav/R-drive/Reference/HA1B25B/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa \
-V Reference_dbsnp.vcf.gz \
--validationTypeToExclude ALL

I had this ERROR:

INFO 15:07:26,868 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:07:26,870 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41
INFO 15:07:26,871 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 15:07:26,871 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 15:07:26,875 HelpFormatter - Program Args: -T ValidateVariants -R /home/mbxav/R-drive/Reference/HA1B25B/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -V Reference_dbsnp.vcf.gz --validationTypeToExclude ALL
INFO 15:07:26,878 HelpFormatter - Executing as mbxav@godzilla on Linux 3.13.0-101-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03.
INFO 15:07:26,879 HelpFormatter - Date/Time: 2016/12/22 15:07:26
INFO 15:07:26,879 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:07:26,879 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:07:31,624 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:07:35,414 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
WARN 15:07:35,875 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 15:07:35,943 GenomeAnalysisEngine - Preparing for traversal
INFO 15:07:35,966 GenomeAnalysisEngine - Done preparing for traversal
INFO 15:07:35,966 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 15:07:35,966 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 15:07:35,967 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 15:08:02,988 ValidateVariants - Reference allele is too long (202) at position 1:55374364; skipping that record.
INFO 15:08:05,969 ProgressMeter - 1:60995549 1341694.0 30.0 s 22.0 s 5.0% 10.1 m 9.6 m
INFO 15:08:35,971 ProgressMeter - 1:109997929 2403623.0 60.0 s 24.0 s 8.9% 11.2 m 10.2 m
INFO 15:09:05,972 ProgressMeter - 1:154428030 3343004.0 90.0 s 26.0 s 12.6% 11.9 m 10.4 m
INFO 15:09:35,973 ProgressMeter - 2:19998392 4777046.0 120.0 s 25.0 s 17.6% 11.4 m 9.4 m
INFO 15:10:02,192 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.4-0-g7e26428):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: The provided VCF file is malformed at approximately line number 5465574: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "RSPOS=67905532;GENEINFO=101749622:SERPINB10 CPOX|420895:SERPINB6;dbSNPBuildID=138;SAO=0;VC=snp;VLD;VP=0500000C0005000000000100", for input source: /home/mbxav/R-drive/Reference/HA1A22A/dbsnp/files/Reference_dbsnp.vcf.gz
ERROR ------------------------------------------------------------------------------------------

I tried to fix the whitespace error with this:

sed -i 's/^RSPOS=67905532;GENEINFO=101749622:SERPINB10 CPOX|420895:SERPINB6;dbSNPBuildID=138;SAO=0;VC=snp;VLD;VP=0500000C0005000000000100/RSPOS=67905532;GENEINFO=101749622:SERPINB10CPOX|420895:SERPINB6;dbSNPBuildID=138;SAO=0;VC=snp;VLD;VP=0500000C0005000000000100/' Reference_dbsnp.vcf.gz > test.vcf.gz

Unfortunately, when I tried to validate again the variants, I got the same mistake in the same position, even if I reindex the output file.

If someone have any idea of how to fix this I will really appreciate it. Thanks.

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @adriana_v,

    My guess is that the problem stems from the Concat files in a single one step. I'm unfamiliar with vcf-concat. I would recommend that you instead use either Picard's GatherVcfs or MergeVcfs. The choice between the two depends on what is in the VCF headers. I think this should solve your issue. If not, let us know. Happy holidays.

Sign In or Register to comment.