The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

VQSR error: "The provided VCF file is malformed at... "

I am seeing this error on single human WGS sample -

The provided VCF file is malformed at approximately line number "x": there are 557 genotypes while the header requires that 1525 genotypes be present for all records

Interestingly, when I run VQSR as part of the same pipeline on the same sample consecutive times, the "x" changes to different line numbers each time. I was wondering if someone could explain the meaning of the error message more?

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Gareth, have you tried validating your VCF (with vcftools)? And can you tell me if your VCF was produced directly by GATK or if it was modified in any way by other tools?

  • These VCFs were produced in combination with Picard AddorReplaceReadGroups and MarkDuplicates. While I try to get vcftools running, could you look at the warnings produced by GATK's ValidateVariants? Do you think this is a reason we are seeing the error?

    WARN 16:38:37,627 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:176515816, 1 vs. 1 *****
    WARN 16:38:37,642 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:177378571, 1 vs. 1 *****
    WARN 16:38:37,723 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr5:179853352, 1 vs. 1 *****
    WARN 16:38:37,772 ValidateVariants - ***** the Allele Count (AC) tag is incorrect for the record at position chr6:910087, 1 vs. 1 *****

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Might be a symptom, but not the cause... Can you tell me what version you're using and what are the successive command lines that were used in the pipeline?

  • could you by chance be using this file "1000G_omni2.5.b37.vcf" in the 1.5/b37 GATK/resource bundle and running VariantRecalibrator? that file does contain 1525 samples...which kind of says to me that maybe your copy of this file is corrupted (thus why it is saying that it requires 1525 genotypes, but only finds 557 genotypes).

  • Kurt, I think you might be right about this, although I am using the 1000G_omni2.5.hg19.vcf. I'm going to try re-downloading and running again.

    As for the command lines, here they are:

    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T RealignerTargetCreator -I ${BAM} -o ${BAM}.intervals;
    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTK.jar -I ${BAM} -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals ${BAM}.intervals -o ${BAM}.realigned.bam

    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -nt 30 -I $1 -o $1.SNP.vcf -R /home/Pegasus5/HG01140/ucsc.hg19.fasta -glm SNP -metrics $1.SNP.metrics

    java -Xmx12g -jar ${TOOLS}GATK/GenomeAnalysisTKLite.jar -T VariantRecalibrator -input $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf -R /home/Pegasus5/HG01140/ucsc.hg19.fasta -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${HAPMAP} -resource:omni,known=false,training=true,truth=false,prior=12.0 ${KGP} -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 ${DBSNP} -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -recalFile $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.recal -rscriptFile $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.plots.R -tranchesFile $1.novoalign.merged.sorted.rg.dedup.bam.bam.realigned.bam.SNP.vcf.tranches -nt 8



  • To follow up here : a re-downloaded 1000G_omni2.5.hg19.sites.vcf did the trick! I guess the old one we had was either partial or out of date? Thanks again for all the help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Thanks for confirming the problem was solved.

Sign In or Register to comment.