Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

[ERROR] Malformed VCF: empty alleles are not permitted in VCF records

mattqdeanmattqdean CAMember
edited February 2016 in Ask the GATK team

I am running BaseRecalibrator for my RNA-seq:

java -jar -Xmx120g ${GATK} -T BaseRecalibrator -R "${reference}" -I "${file4}" -knownSites "${gerVar}" -knownSites "${somVar}" -o "${file4%_tstaids.bam}_tstaidsr.table1"
java -jar -Xmx120g ${GATK} -T BaseRecalibrator -R "${reference}" -I "${file4}" -knownSites "${gerVar}" -knownSites "${somVar}" -BQSR "${file4%_tstaids.bam}_tstaidsr.table1" -o "${file4%_tstaids.bam}_tstaidsr.table2"
java -jar -Xmx120g ${GATK} -T AnalyzeCovariates -R "${reference}" -before "${file4%_tstaids.bam}_tstaidsr.table1" -after "${file4%_tstaids.bam}_tstaidsr.table2" -plots "${file1%_tsta.bam}_BQSR.pdf"
java -jar -Xmx120g ${GATK} -T PrintReads -R "{reference}" -I "${file4}" -BQSR "${file4%_tstaids.bam}_tstaidsr.table1" -o "${file7}"

Note that I got 2 variant VCF from Ensembl (germline and somatic). My reference is Ensembl GRCh38.p5. I ran the command below to append 'chr' notation and change chrMT to chrM:

sed -e '/^[^#]/s/^/chr/' -e 's/^chrMT/chrM/'

I received this error:

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 18354680: empty alleles are not permitted in VCF records

I used the command below to inspect my VCF file (it is ${gerVar} that is malformed):

sed -n '18354680p'

which returned:

chr11 5249456 HbVar.633 G . . PhenCode_20140430;TSA=sequence_alteration;AA=A

Post edited by mattqdean on

Best Answer

Answers

  • girardotgirardot Heidelberg, GermanyMember

    Hi all,

    I have the same issue but it is a bit weird. I am using VCF from dbsnp for the fly (from yesterday) and latest GATK (3.7). They provide a vcd.gz together with .tbi index. If I use the cvs.gz together with its .tbi (colocated) it works just fine but if I remove the .tbi ; then I have the error. I can already ear you thinking "why the hell does he removes the .tbi", well this is because I initially tried to run GATK in Galaxy with the uncompressed file (without associated index) and got the same error.
    Do you have an idea why the presence of the index makes GATK considers the VCF valid ?
    thx

    Charles

  • girardotgirardot Heidelberg, GermanyMember

    sorry for the bad autocorrection : vcd.gz and cvs.gz should read vcf.gz

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @girardot
    Hi,

    All GATK tools absolutely require a VCF index along with the VCF. GATK will index uncompressed VCF files on-the-fly, but it will not index compressed VCF files. I am surprised you got an error running on an uncompressed VCF without an index. Perhaps it is a Galaxy issue?

    -Sheila

  • splaisansplaisan Leuven (Belgium)Member ✭✭

    For whom this could help, I had the same from a fruitfly dbsnp data from NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/fruitfly_7227/VCF/00-All.vcf.gz

    vcf-validator reports:

    The column ALT is empty at 3R:19738372.
    The column ALT is empty at 3R:24346609.
    The column ALT is empty at 3R:28099302.
    The column ALT is empty at X:2782218.
    The column ALT is empty at X:10101813.
    The column ALT is empty at X:12757524.
    The column ALT is empty at X:18195320.

    I could correct the vcf with the following awk snippet

    gawk 'BEGIN{FS="\t"; OFS="\t"}{if (NF>1 && $5=="") {$5="."; print $0} else print $0}' fruitfly_7227.vcf > fruitfly_7227_corr.vcf

Sign In or Register to comment.