We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VariantRecalibrator malformed header error

Hi can anyone please help me out with this error message from VariantRecalibrator:

MESSAGE: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

My command line is :
java -jar ~/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg38.fa -input Sample_aln_filtered_sorted_nodup_rgHapGVCFJoint.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf -an DP -an QD -an FS
-an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9
-tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

My input file is derived directly from the output of HaplotypeCaller using GVCF option and then joined all samples using GenotypeGVCFs.

I have seen this error posted before but I have not been able to solve the issue using any of the given replies and answers.

Any help would be greatly appreciated.

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you please post the full log output? Please also post the header of your input VCF (you can remove the @SQ lines for brevity).

  • AnastasisAnastasis Member

    Hi and thanks for you reply.

    This is the full error log:
    INFO 08:42:38,095 HelpFormatter - Executing as myhost on Linux 2.6.32-504.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13.
    INFO 08:42:38,096 HelpFormatter - Date/Time: 2016/10/04 08:42:38
    INFO 08:42:38,096 HelpFormatter - --------------------------------------------------------------------------------
    INFO 08:42:38,096 HelpFormatter - --------------------------------------------------------------------------------
    INFO 08:42:38,381 GenomeAnalysisEngine - Strictness is SILENT
    INFO 08:42:38,659 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
    ERROR ------------------------------------------------------------------------------------------

    The header is too many lines, not sure I can paste it here, but if you have any suggestions of things that might be faulty with respect to formatting issues, perhaps you could let me know...

    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    I'd like to see the entire output log, not just the error log, please.
  • AnastasisAnastasis Member

    Hello and thanks again!

    Here is the rest of the output log:

    INFO 15:51:29,901 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:51:29,903 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
    INFO 15:51:29,903 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 15:51:29,904 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
    INFO 15:51:29,904 HelpFormatter - [Tue Oct 04 15:51:29 EEST 2016] Executing on Linux 2.6.32-504.el6.x86_64 amd64
    INFO 15:51:29,904 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13 JdkDeflater
    INFO 15:51:29,908 HelpFormatter - Program Args: -T VariantRecalibrator -R hg38.fa -input ND_All_aln_filtered_sorted_nodup_rgHapGVCFJoint.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 All_20160527.vcf -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
    INFO 15:51:29,929 HelpFormatter - Executing as myhost on Linux 2.6.32-504.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13.
    INFO 15:51:29,930 HelpFormatter - Date/Time: 2016/10/04 15:51:29
    INFO 15:51:29,930 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:51:29,930 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:51:30,274 GenomeAnalysisEngine - Strictness is SILENT
    INFO 15:51:30,613 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Thanks -- I wanted to check that the interpreted command line looked correct, which it does.

    Based on this nothing obvious stands out. Try running ValidateVariants on your input vcf and see if that produces the same error. That will tell us if the VariantRecalibrator is misbehaving or if your file is actually malformed without looking through it manually.

    If it is, you'll need to check the output log of GenotypeGVCFs to see if anything went wrong there. You should also check if the file was copied over between being output and your attempt to run now. Something could have gone wrong, a file system glitch or something like that that caused file corruption. If that's the case you should redo the GenotypeGVCFs step to regenerate the file.
  • AnastasisAnastasis Member

    thank you I will look into it and get back to you ...

  • AnastasisAnastasis Member

    Managed to solve the problem, so I think that the formatting issues was not in the -input vcf; but in one of the -resource vcfs. The error log was sort of misleading in that sense. Perhaps it could be more informative if the file name of the malformed vcf was provided in the error log.

    Many thanks again for replies.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Ah, thanks for reporting this. VariantRecalibrator has some of the more opaque error messages in GATK... we'll try to fix that up.
Sign In or Register to comment.