[ERROR] Malformed VCF: empty alleles are not permitted in VCF records

mattqdeanmattqdean CAMember
edited February 2016 in Ask the GATK team

I am running BaseRecalibrator for my RNA-seq:

java -jar -Xmx120g ${GATK} -T BaseRecalibrator -R "${reference}" -I "${file4}" -knownSites "${gerVar}" -knownSites "${somVar}" -o "${file4%_tstaids.bam}_tstaidsr.table1"
java -jar -Xmx120g ${GATK} -T BaseRecalibrator -R "${reference}" -I "${file4}" -knownSites "${gerVar}" -knownSites "${somVar}" -BQSR "${file4%_tstaids.bam}_tstaidsr.table1" -o "${file4%_tstaids.bam}_tstaidsr.table2"
java -jar -Xmx120g ${GATK} -T AnalyzeCovariates -R "${reference}" -before "${file4%_tstaids.bam}_tstaidsr.table1" -after "${file4%_tstaids.bam}_tstaidsr.table2" -plots "${file1%_tsta.bam}_BQSR.pdf"
java -jar -Xmx120g ${GATK} -T PrintReads -R "{reference}" -I "${file4}" -BQSR "${file4%_tstaids.bam}_tstaidsr.table1" -o "${file7}"

Note that I got 2 variant VCF from Ensembl (germline and somatic). My reference is Ensembl GRCh38.p5. I ran the command below to append 'chr' notation and change chrMT to chrM:

sed -e '/^[^#]/s/^/chr/' -e 's/^chrMT/chrM/'

I received this error:

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 18354680: empty alleles are not permitted in VCF records

I used the command below to inspect my VCF file (it is ${gerVar} that is malformed):

sed -n '18354680p'

which returned:

chr11 5249456 HbVar.633 G . . PhenCode_20140430;TSA=sequence_alteration;AA=A

Post edited by mattqdean on

Best Answer

Answers

  • girardotgirardot Heidelberg, GermanyMember

    Hi all,

    I have the same issue but it is a bit weird. I am using VCF from dbsnp for the fly (from yesterday) and latest GATK (3.7). They provide a vcd.gz together with .tbi index. If I use the cvs.gz together with its .tbi (colocated) it works just fine but if I remove the .tbi ; then I have the error. I can already ear you thinking "why the hell does he removes the .tbi", well this is because I initially tried to run GATK in Galaxy with the uncompressed file (without associated index) and got the same error.
    Do you have an idea why the presence of the index makes GATK considers the VCF valid ?
    thx

    Charles

  • girardotgirardot Heidelberg, GermanyMember

    sorry for the bad autocorrection : vcd.gz and cvs.gz should read vcf.gz

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @girardot
    Hi,

    All GATK tools absolutely require a VCF index along with the VCF. GATK will index uncompressed VCF files on-the-fly, but it will not index compressed VCF files. I am surprised you got an error running on an uncompressed VCF without an index. Perhaps it is a Galaxy issue?

    -Sheila

  • splaisansplaisan Leuven / Gent (Belgium)Member

    For whom this could help, I had the same from a fruitfly dbsnp data from NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/fruitfly_7227/VCF/00-All.vcf.gz

    vcf-validator reports:

    The column ALT is empty at 3R:19738372.
    The column ALT is empty at 3R:24346609.
    The column ALT is empty at 3R:28099302.
    The column ALT is empty at X:2782218.
    The column ALT is empty at X:10101813.
    The column ALT is empty at X:12757524.
    The column ALT is empty at X:18195320.

    I could correct the vcf with the following awk snippet

    gawk 'BEGIN{FS="\t"; OFS="\t"}{if (NF>1 && $5=="") {$5="."; print $0} else print $0}' fruitfly_7227.vcf > fruitfly_7227_corr.vcf

Sign In or Register to comment.