The provided VCF file is malformed at...

naarkhoonaarkhoo Posts: 38Member

My question could seems like here but, the answer didn't help me.

I am using VariantFiltration over a VCF file which is generated directly after UnifiedGenotype under GenomeAnalysisTK-2.3-9-ge5ebf34.

The error I am facing is

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 126: there aren't enough columns for line 70 (we expected 9 tokens, and saw 1 )

Line number 126 is as following,

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  m1016ROUa.40287 m1023ROGa.40244 m1042ujba.40261 m1069FXFa.49470

And actually indeed it is the header of VCF file ! Should I re-run my samples ?!!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi there,

    Did you do any manipulations on the file after it was made by UG, before you got the error? UG shouldn't be generating bad vcf files.

    if you can't find the source of the error, I suggest you run your samples again with the new version (2.4).

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member
    edited May 2013

    I have the same problem, except when using CombineVariants 2.4 trying to merge 4 VCFs generated with UnifiedGenotyper 2.4:

    ERROR MESSAGE: Line 4623: there aren't enough columns for line

    I haven't experienced this problem with other VCF files; i.e. other VCF files generated in the exact same way for the same samples but for different chromosome fragments. I also tried to regenerated the 4 VCF files in question.

    This is what line 4623 in those 4 VCFs look like:

    12 50240233 . A . 0.08 LowQual AN=122;DP=174;MQ=51.69;MQ0=0 GT:DP 0/0:1 0/0:4 0/0:3 0/0:1 0/0:1 0/0:2 0/0:1 0/0:5 0/0:2 0/0:1 0/0:3 0/0:2 ./. 0/0:1 0/0:1 0/0:3 0/0:1 0/0:3 0/0:1 0/0:2 0/0:1 0/0:2 0/0:1 0/0:1 0/0:3 0/0:2 0/0:2 0/0:2 0/0:4 0/0:1 0/0:6 0/0:6 0/0:3 ./. 0/0:6 0/0:3 0/0:3 0/0:3 0/0:3 ./. 0/0:3 0/0:4 0/0:2 0/0:2 0/0:4 0/0:2 ./. ./. 0/0:5 0/0:2 0/0:3 0/0:3 0/0:4 0/0:1 0/0:4 0/0:2 ./. 0/0:2 0/0:3 0/0:2 0/0:3 0/0:5 0/0:8 0/0:6 0/0:2 0/0:4 0/0:4

    12 50233207 . G . 33.96 . AN=198;DP=544;MQ=58.53;MQ0=0 GT:DP 0/0:3 0/0:8 0/0:10 0/0:9 0/0:4 0/0:6 0/0:4 0/0:3 0/0:4 0/0:8 0/0:6 0/0:5 0/0:7 0/0:3 0/0:8 0/0:2 0/0:3 0/0:2 0/0:3 0/0:4 0/0:8 0/0:2 0/0:5 0/0:3 0/0:3 0/0:6 0/0:9 0/0:6 0/0:2 0/0:4 0/0:10 0/0:6 0/0:1 0/0:7 0/0:9 0/0:6 0/0:8 0/0:5 0/0:3 ./. 0/0:4 0/0:6 0/0:5 0/0:2 0/0:8 0/0:4 0/0:6 0/0:5 0/0:9 0/0:5 0/0:6 0/0:9 0/0:3 0/0:5 0/0:4 0/0:6 0/0:10 0/0:6 0/0:7 0/0:2 0/0:8 0/0:6 0/0:9 0/0:6 0/0:2 0/0:4 0/0:4 0/0:7 0/0:6 0/0:4 0/0:5 0/0:5 0/0:4 0/0:6 0/0:3 0/0:4 0/0:5 0/0:6 0/0:6 0/0:6 0/0:9 0/0:4 0/0:4 0/0:8 0/0:8 0/0:5 0/0:8 0/0:3 0/0:2 0/0:6 0/0:8 0/0:4 0/0:5 0/0:10 0/0:12 0/0:3 0/0:6 0/0:6 0/0:3 0/0:7

    12 50233595 . T . 27.78 . AN=200;DP=563;MQ=59.33;MQ0=0 GT:DP 0/0:4 0/0:6 0/0:5 0/0:11 0/0:4 0/0:1 0/0:1 0/0:6 0/0:7 0/0:8 0/0:9 0/0:5 0/0:8 0/0:4 0/0:7 0/0:9 0/0:4 0/0:4 0/0:7 0/0:7 0/0:5 0/0:7 0/0:10 0/0:3 0/0:8 0/0:5 0/0:3 0/0:1 0/0:1 0/0:8 0/0:3 0/0:4 0/0:7 0/0:6 0/0:2 0/0:2 0/0:1 0/0:2 0/0:5 0/0:3 0/0:9 0/0:7 0/0:6 0/0:2 0/0:14 0/0:3 0/0:7 0/0:9 0/0:4 0/0:3 0/0:3 0/0:3 0/0:4 0/0:3 0/0:8 0/0:11 0/0:3 0/0:7 0/0:6 0/0:4 0/0:6 0/0:3 0/0:9 0/0:2 0/0:8 0/0:6 0/0:10 0/0:5 0/0:1 0/0:4 0/0:1 0/0:4 0/0:5 0/0:8 0/0:4 0/0:7 0/0:3 0/0:5 0/0:6 0/0:11 0/0:6 0/0:12 0/0:5 0/0:8 0/0:6 0/0:5 0/0:8 0/0:3 0/0:9 0/0:7 0/0:8 0/0:6 0/0:6 0/0:4 0/0:3 0/0:8 0/0:5 0/0:10 0/0:12 0/0:5

    12 50235340 . G . 29.23 . AN=240;DP=575;MQ=59.38;MQ0=0 GT:DP 0/0:3 0/0:1 0/0:5 0/0:3 0/0:4 0/0:1 0/0:6 0/0:3 0/0:3 0/0:7 0/0:9 0/0:5 0/0:5 0/0:3 0/0:5 0/0:2 0/0:4 0/0:6 0/0:7 0/0:9 0/0:4 0/0:4 0/0:5 0/0:2 0/0:4 0/0:4 0/0:1 0/0:6 0/0:2 0/0:6 0/0:4 0/0:2 0/0:2 0/0:5 0/0:5 0/0:5 0/0:2 0/0:3 0/0:10 0/0:1 0/0:1 0/0:4 0/0:3 0/0:5 0/0:3 0/0:1 0/0:6 0/0:6 0/0:10 0/0:8 0/0:4 0/0:5 0/0:1 0/0:3 0/0:4 0/0:6 0/0:6 0/0:2 0/0:6 0/0:6 0/0:4 0/0:11 0/0:1 0/0:4 0/0:5 0/0:3 0/0:3 0/0:10 0/0:3 0/0:3 0/0:7 0/0:5 0/0:6 0/0:8 0/0:4 0/0:9 0/0:6 0/0:3 0/0:10 0/0:2 0/0:9 0/0:11 0/0:8 0/0:6 0/0:4 0/0:2 0/0:2 0/0:6 0/0:10 0/0:2 0/0:9 0/0:6 0/0:4 0/0:6 0/0:4 0/0:8 0/0:1 0/0:6 0/0:2 0/0:7 0/0:2 0/0:7 0/0:4 0/0:3 0/0:4 0/0:5 0/0:3 0/0:7 0/0:4 0/0:3 0/0:6 0/0:2 0/0:3 0/0:7 0/0:5 0/0:11 0/0:6 0/0:10 0/0:1 0/0:4

    Any thoughts on what is causing the error message and how to avoid it? Thanks.

    Post edited by tommycarstensen on
  • tommycarstensentommycarstensen Posts: 112Member

    Please ignore my comment. I had only regenerated 3 of my 4 VCFs. The regeneration of the 4th one fixed the problem. I'm not sure, why the VCF file was malformed in the first place and I can't see that it is malformed. I recently ran out of disk space. It might be related to this. Sorry for posting.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    No worries, just glad it's resolved.

    If you ran out of space maybe the file was truncated.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 112Member

    The thing that puzzled me was that I had the expected number of columns in all 4 lines and it was not the last line of the file in any of the 4 VCF files. However, problem resolved and it most probably originated at my end. Thank you.

  • Bettina_HarrBettina_Harr Posts: 22Member
    edited May 5

    I am getting this error when I run GenotypeGVCFs on my gVCF file.

    MESSAGE: Line 1056430: there aren't enough columns for line .   .   END=11516762    GT:DP:GQ:MIN_DP:PL  0/0:23:53:23:0,54,624 (we expected 9 tokens, and saw 5 ), for input source: /home/mpg05/bharr/ILLUMINA/gVCF_files/JR15.recal.bam14.gVCF
    

    command:

    java -Xmx10g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -R /usr/users/bharr/ILLUMINA/Mus_musculus.GRCm38.74.dna.chromosome.fa -T GenotypeGVCFs --variant 14.recal.bam14.gVCF --variant 15B.recal.bam14.gVCF --variant 16B.recal.bam14.gVCF --variant 18B.recal.bam14.gVCF --variant B2C.recal.bam14.gVCF --variant C1.recal.bam14.gVCF --variant E1.recal.bam14.gVCF --variant F1B.recal.bam14.gVCF --variant AH15.recal.bam14.gVCF --variant AH23.recal.bam14.gVCF --variant JR11.recal.bam14.gVCF --variant JR15.recal.bam14.gVCF --variant JR2-F1C.recal.bam14.gVCF --variant JR5-F1C.recal.bam14.gVCF --variant JR7-F1C.recal.bam14.gVCF --variant JR8-F1A.recal.bam14.gVCF --variant TP121B.recal.bam14.gVCF --variant TP17-2.recal.bam14.gVCF --variant TP1.recal.bam14.gVCF --variant TP3-92.recal.bam14.gVCF --variant TP4a.recal.bam14.gVCF --variant TP51D.recal.bam14.gVCF --variant TP7-10F1A2.recal.bam14.gVCF --variant TP81B.recal.bam14.gVCF -o ALL_14.vcf
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @Bettina_Harr‌,

    This might be a parsing error due to the extension you're using; the program expects the extension to be .vcf, not .gvcf. If you want to indicate that the files are GVCF in the file name, a popular solution is to use .g.vcf.

    Geraldine Van der Auwera, PhD

  • Bettina_HarrBettina_Harr Posts: 22Member

    Hi Geraldine, thanks for your response. The extension is definitely not the problem, as the data for about half of the other chromosomes work fine with that extension. I have problems with chromosomes 2,4,5,7,14,16 but all other mouse chromosomes work perfectly (I am doing all my analyses split by chromosome, as otherwise the software is too slow). I also checked the gVCF file at this line and is indeed corrupted. Weird thing is, the Haplotype Caller did not give me an error message when this file was generated. I kept out and error file and they report that the run was Successfully completed. The same problem happens with other files, but not the exact same line. I.e. it is random which line is corrupted. Right now I have no idea what to do about this, as HaplotypeCaller does not give an error.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hmm, the tools should refuse to work with anything else than .vcf. I guess we forgot to enforce that check somewhere, I'll have to look into that.

    This apparently random file corruption thing sounds like a platform/system issue, you might want to talk to your IT support people.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.