We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

CombineGVCFs : key isn't defined in the VCFHeader

Hi there,

I am trying to combine gVCFs using CombineGVCFs and get the error: "Key END found in VariantContext field INFO at chr1:10439 but this key isn't defined in the VCFHeader".

The gVCFs are generated using HaplotypeCaller GATK4:
gatk --java-options "-Xmx50G" HaplotypeCaller -R Homo_sapiens_assembly38.fasta -I x.bam -O x.g.vcf.gz

In the gVCF headers are "contig=<ID=chr1,length=248956422>" and the position "chr1 10439 . AC A 359.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.836;ClippingRankSum=0.000;DP=36;ExcessHet=3.0103;FS=2.063;MLEAC=1;MLEAF=0.500;MQ=37.15;MQRankSum=0.425;QD=12.85;ReadPosRankSum=-0.401;SOR=1.445 GT:AD:DP:GQ:PL 0/1:7,21:28:99:397,0,129"

My question is: What Key is the error referring to (maybe: contig=<ID=chr1,length=248956422>) and how should it be defined in the header giving the above position?

Is there any obvious solution to this?

Thanks you!

Best Answer


  • estest Member

    So, I downloaded GATK 4 a week or so ago (v4.0.7.0) and I'm getting loci that complain about this error scattered throughout the thousands of gvcfs that I generated using haplotype caller when I try to merge them.

    The Caller is:
    java -Xmx6g -jar /data1/bin/gatk4.jar HaplotypeCaller \ --output-mode EMIT_ALL_CONFIDENT_SITES \ --ERC GVCF \ -R /data1/public/ref/hg38/gatk_bundle/Homo_sapiens_assembly38.fasta \ --genotyping-mode DISCOVERY \ -A BaseQuality \ -A MappingQuality \ -G StandardAnnotation \ --min-base-quality-score 20 \ --dbsnp /data1/public/ref/hg38/gatk_bundle/dbsnp_138.hg38.vcf.gz \ -I #{input_bam} \ -O #{name}.g.vcf

    And the combine call (whittled down to the smallest that exhibits the error)
    java -Xmx6g -jar /data1/bin/gatk4.jar CombineGVCFs -R /data1/public/ref/hg38/gatk_bundle/Homo_sapiens_assembly38.fasta -L chr1 -L chr2 -L chr3 -L chr4 -L chr5 -L chr6 -L ch
    r7 -L chr8 -L chr9 -L chr10 -L chr11 -L chr12 -L chr13 -L chr14 -L chr15 -L chr16 -L chr17 -L chr18 -L chr19 -L chr20 -L chr21 -L chr22 -L chrY -L chrX \
    -V /data1/stuff/intermediate/gvcf/drv-41826.g.vcf.gz \
    -V /data1/stuff/intermediate/gvcf/drv-250.g.vcf.gz -O drv-0.g.vcf

    Final error from GATK
    java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr1:15811 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

    The guilty line is:
    chr1 15811 . TCTG . 103.27 . AN=2;DP=26;MQ=27.29 GT:AD:DP:MBQ:MMQ 0/0:26:26:41

    chr1 15811 . TCTG . 208.27 . AN=2;DP=60;MQ=36.41 GT:AD:DP:MBQ:MMQ 0/0:60:60:41

    When I grepped that particular position, I see a handful of similar looking loci.

    I don't have END defined in the VCF header, but there are thousands of samples, so manually editing these files isn't terribly appealing if I can avoid it (I haven't done a test to verify that would actually fix the problem).

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Hmm. Can you test the very latest version on a small snippet (from HaplotypeCaller to GenotypeGVCFs)?


  • estest Member

    I did download ealier this week (I didn't realize I've been working on this for a month or so). My tests are still ongoing, but so far, if I use the gatk wrapper script and not calling it directly using java -jar, it seems to be working (increasing the number of samples as they complete on a single machine (had too many resubmissions to want to submit everything to our cluster if it was going to have to be redone yet again).

    Is it possible that some of those additional flags that the wrapper script pass to the java call are protecting it from whatever is causing the problem? Just replacing the newer GATK jar in place of the old and running the scripts as they were (java -jar ....) had the same problem as with the version I was using last week.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited October 2018

    Hi @est,

    Sheila has moved on to greener pastures and I'm helping our new support specialist ramp up. Did you solve your issue? I see you are using some wrapper script. Unfortunately, we cannot help you on issues with external wrapper scripts.

    I think our gatk-workflows repository scripts might be of interest to you. We have tried-and-tested WDL pipeline scripts for GATK Best Practice workflows that handle parallelizing over genomic intervals. Cromwell is the tool that runs the WDL language scripts and can interpret the same pipeline locally, on a cluster, or in the cloud. If you don't want to set these up yourself, we also have preconfigured these same workflows to run on Google Cloud. Essentially, all you need is to provide your own data. The platform is explained here. You can sign up for free credits towards testing, e.g. debugging, here.

Sign In or Register to comment.