Problem with a vcf file


I am working with the rat Rn5 .vcf file from Ensembl and I am getting an error message regarding the format of the .vcf file when I try to run the GATK RealigneTargetCreator.

"MESSAGE: The provided VCF file is malformed at approximately line number 14: Unparsable vcf record with allele G."

Any help regarding this issue will be welcome,
Thanks in advance

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Can you please post the line that the program is complaining about? Without seeing it I cannot say what the problem might be.

  • ghlopezghlopez Member

    HI there,

    Sorry, here it goes in a more extended manner:

    NFO 11:24:38,579 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:24:38,583 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
    INFO 11:24:38,583 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 11:24:38,583 HelpFormatter - For support and documentation go to
    INFO 11:24:38,588 HelpFormatter - Program Args: -T RealignerTargetCreator -fixMisencodedQuals -R /Volumes/Data2/genomes/Rn5/Rattus_norvegicus/Ensembl/Rnor_5.0/Sequence/WholeGenomeFasta/genome.fa -I /Volumes/Data/AnalysesTemp/H3_rn5_dedup.bam -known:name,VCF /Volumes/Data2/genomes/Rn5/Rattus_norvegicus/Ensembl/Rnor_5.0/Annotation/Variation/Rattus_norvegicus_indels.vcf -o /Volumes/Data2/RAT-NGS/Analyses/targetIntervalsH3_1v1.list
    INFO 11:24:38,588 HelpFormatter - Date/Time: 2013/12/02 11:24:38
    INFO 11:24:38,588 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:24:38,588 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:24:39,197 GenomeAnalysisEngine - Strictness is SILENT
    INFO 11:24:39,302 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 11:24:39,310 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 11:24:39,337 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
    INFO 11:24:39,358 RMDTrackBuilder - Creating Tribble index in memory for file /Volumes/Data2/genomes/Rn5/Rattus_norvegicus/Ensembl/Rnor_5.0/Annotation/Variation/Rattus_norvegicus_indels.vcf
    INFO 11:24:41,624 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR MESSAGE: The provided VCF file is malformed at approximately line number 14: Unparsable vcf record with allele G.
    ERROR ------------------------------------------------------------------------------------------


    Thanks for your help

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks, but I meant the line in the VCF file that the GATK is choking on. Actually if you can post the first 20 lines or so of the VCF that would be great.

  • ghlopezghlopez Member
    edited December 2013

    OOps sorry, here it goes,

    ##INFO=<ID=TSA,Number=0,Type=String,Description="Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project.">
    ##INFO=<ID=dbSNP_138,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP">
    1   3564603 rs106167861 GGA G.  .   .   dbSNP_138;TSA=sequence_alteration
    1   5805668 rs105092325 GAAAAACACACACACACACACACACATATATATATATATATTTGTCTGGTTGGTT G.  .   .   dbSNP_138;TSA=sequence_alteration
    1   9216715 rs105872294 GAA G.  .   .   dbSNP_138;TSA=sequence_alteration
    1   9218305 rs107435552 AGC A.  .   .   dbSNP_138;TSA=sequence_alteration
    1   9482686 rs106054993 GGG G.  .   .   dbSNP_138;TSA=sequence_alteration


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see, the ALT allele is G. -- that dot following the letter is not allowed by the SAM specification as far as I know (someone please correct me if it is allowed; I haven't seen this before). Every variant seems to have this extra dot in the ALT field. Did you get this file straight from Ensembl?

  • ghlopezghlopez Member


    I guessed the same, the first time I got the error, this is not the original Ensemble file (I simply removed the SNVs from the original one),. II double check it and in the original file from Ensemble SNVs do not hate the extra dot whereas the Indels do have it and look like the ones I sent you.

    Should I simply try to get rid of those dots?

    Thanks for everything,

  • pdexheimerpdexheimer Member ✭✭✭✭

    that dot rings a bell - is it the same as the "single ended breakpoints" discussed in another thread?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I do believe you're right, @pdexheimer. Dot's on the other side compared to last time, but it's basically the same issue.

    @ghlopez, this is a way of writing ALT alleles that we currently do not support. I'm not sure if just removing the dots is the right thing to do (because I'm not familiar enough with this usage) so I would recommend asking the providers of the data (either Ensembl support or the researchers who submitted the data to them, I'm don't what the relative responsibilities are there).

  • ghlopezghlopez Member

    Thank you both, for your useful comments.
    I will check it with Ensembl support and see what they suggest.

  • anjaanja Member

    I am generating the ensembl VCF files and just had a look at this. The problem is that for the variant rs105872294 we don't have any information on the ALT allele: We only have the ID: (D1RAT249) and a location: 1:9216716-9216717. To represent this in VCF I use the character for the missing value which is a dot. But if this is not recognized by the parser I was wondering if would be better to use an angle-bracketed ID String: REF: GAA ALT: G<(D1RAT249)>?

  • ghlopezghlopez Member

    HI again,

    Thank you both for the comments. I think I may follow Geraldine's suggestion and try to change those unknown alleles with a "." and try to keep going with the analysis.

Sign In or Register to comment.