# VQSR error: "The provided VCF file is malformed at... "

Posts: 4

I am seeing this error on single human WGS sample -

The provided VCF file is malformed at approximately line number "x": there are 557 genotypes while the header requires that 1525 genotypes be present for all records

Interestingly, when I run VQSR as part of the same pipeline on the same sample consecutive times, the "x" changes to different line numbers each time. I was wondering if someone could explain the meaning of the error message more?

Hi Gareth, have you tried validating your VCF (with vcftools)? And can you tell me if your VCF was produced directly by GATK or if it was modified in any way by other tools?

Geraldine Van der Auwera, PhD

• Posts: 4

These VCFs were produced in combination with Picard AddorReplaceReadGroups and MarkDuplicates. While I try to get vcftools running, could you look at the warnings produced by GATK's ValidateVariants? Do you think this is a reason we are seeing the error?

WARN 16:38:37,627 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:176515816, 1 vs. 1 *****
WARN 16:38:37,642 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:177378571, 1 vs. 1 *****
WARN 16:38:37,723 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:179853352, 1 vs. 1 *****
WARN 16:38:37,772 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr6:910087, 1 vs. 1 *****

Might be a symptom, but not the cause... Can you tell me what version you're using and what are the successive command lines that were used in the pipeline?

Geraldine Van der Auwera, PhD

• Posts: 255 ✭✭✭

could you by chance be using this file "1000G_omni2.5.b37.vcf" in the 1.5/b37 GATK/resource bundle and running VariantRecalibrator? that file does contain 1525 samples...which kind of says to me that maybe your copy of this file is corrupted (thus why it is saying that it requires 1525 genotypes, but only finds 557 genotypes).

• Posts: 4

As for the command lines, here they are:

java -Xmx12g -jar {TOOLS}GATK/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T RealignerTargetCreator -I{BAM} -o ${BAM}.intervals; java -Xmx12g -jar${TOOLS}GATK/GenomeAnalysisTK.jar -I {BAM} -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals{BAM}.intervals -o {BAM}.realigned.bam java -Xmx12g -jar{TOOLS}GATK/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -nt 30 -I $1 -o$1.SNP.vcf -R /home/Pegasus5/HG01140/ucsc.hg19.fasta -glm SNP -metrics $1.SNP.metrics java -Xmx12g -jar${TOOLS}GATK/GenomeAnalysisTKLite.jar -T VariantRecalibrator -input 1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf -R /home/Pegasus5/HG01140/ucsc.hg19.fasta -resource:hapmap,known=false,training=true,truth=true,prior=15.0{HAPMAP} -resource:omni,known=false,training=true,truth=false,prior=12.0 ${KGP} -resource:dbsnp,known=true,training=false,truth=false,prior=8.0${DBSNP} -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -recalFile 1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.recal -rscriptFile1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.plots.R -tranchesFile \$1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.tranches -nt 8

Where

HAPMAP="/home/Pegasus5/HG01140/hapmap_3.3.hg19.vcf";
DBSNP="/home/Pegasus5/HG01140/dbsnp_135.hg19.vcf.gz";
KGP="/home/Pegasus5/HG01140/1000G_omni2.5.hg19.vcf";

• Posts: 4

To follow up here : a re-downloaded 1000G_omni2.5.hg19.sites.vcf did the trick! I guess the old one we had was either partial or out of date? Thanks again for all the help.