VQSR error: "The provided VCF file is malformed at... "

gareth862gareth862 Posts: 4Member

I am seeing this error on single human WGS sample -

The provided VCF file is malformed at approximately line number "x": there are 557 genotypes while the header requires that 1525 genotypes be present for all records

Interestingly, when I run VQSR as part of the same pipeline on the same sample consecutive times, the "x" changes to different line numbers each time. I was wondering if someone could explain the meaning of the error message more?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,461Administrator, GATK Developer admin

    Hi Gareth, have you tried validating your VCF (with vcftools)? And can you tell me if your VCF was produced directly by GATK or if it was modified in any way by other tools?

    Geraldine Van der Auwera, PhD

  • gareth862gareth862 Posts: 4Member

    These VCFs were produced in combination with Picard AddorReplaceReadGroups and MarkDuplicates. While I try to get vcftools running, could you look at the warnings produced by GATK's ValidateVariants? Do you think this is a reason we are seeing the error?

    WARN 16:38:37,627 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:176515816, 1 vs. 1 ***** WARN 16:38:37,642 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:177378571, 1 vs. 1 ***** WARN 16:38:37,723 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:179853352, 1 vs. 1 ***** WARN 16:38:37,772 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr6:910087, 1 vs. 1 *****

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,461Administrator, GATK Developer admin

    Might be a symptom, but not the cause... Can you tell me what version you're using and what are the successive command lines that were used in the pipeline?

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 164Member ✭✭✭

    could you by chance be using this file "1000G_omni2.5.b37.vcf" in the 1.5/b37 GATK/resource bundle and running VariantRecalibrator? that file does contain 1525 samples...which kind of says to me that maybe your copy of this file is corrupted (thus why it is saying that it requires 1525 genotypes, but only finds 557 genotypes).

  • gareth862gareth862 Posts: 4Member

    Kurt, I think you might be right about this, although I am using the 1000G_omni2.5.hg19.vcf. I'm going to try re-downloading and running again.

    As for the command lines, here they are:

    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T RealignerTargetCreator -I ${BAM} -o ${BAM}.intervals; java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTK.jar -I ${BAM} -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals ${BAM}.intervals -o ${BAM}.realigned.bam

    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -nt 30 -I $1 -o $1.SNP.vcf -R /home/Pegasus5/HG01140/ucsc.hg19.fasta -glm SNP -metrics $1.SNP.metrics

    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTKLite.jar -T VariantRecalibrator -input $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf -R /home/Pegasus5/HG01140/ucsc.hg19.fasta -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${HAPMAP} -resource:omni,known=false,training=true,truth=false,prior=12.0 ${KGP} -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 ${DBSNP} -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -recalFile $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.recal -rscriptFile $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.plots.R -tranchesFile $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.tranches -nt 8

    Where

    HAPMAP="/home/Pegasus5/HG01140/hapmap_3.3.hg19.vcf"; DBSNP="/home/Pegasus5/HG01140/dbsnp_135.hg19.vcf.gz"; KGP="/home/Pegasus5/HG01140/1000G_omni2.5.hg19.vcf";

  • gareth862gareth862 Posts: 4Member

    To follow up here : a re-downloaded 1000G_omni2.5.hg19.sites.vcf did the trick! I guess the old one we had was either partial or out of date? Thanks again for all the help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,461Administrator, GATK Developer admin

    Thanks for confirming the problem was solved.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.