If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK4 MergeVcfs "One or more header lines must be in the header line collection"

FPBarthelFPBarthel HoustonMember ✭✭
edited July 2018 in Ask the GATK team

Hi! I am trying to use MergeVcfs to merge several VCF files (VarScan2 output files) but I am getting the following error:

gatk MergeVcfs \
   -I A.vcf \
   -I B.vcf \
   -D human_g1k_v37_decoy.dict
   -O out.vcf

java.lang.IllegalArgumentException: One or more header lines must be in the header line collection

Unfortunately I cannot find any information about this error message. I have tried using gatk ValidateVariants to validate the input VCF files but this does not return any errors:

gatk ValidateVariants \
   -V A.vcf \
   -R human_g1k_v37_decoy.fasta

12:01:11.764 INFO  ValidateVariants - Done initializing engine
12:01:11.764 INFO  ProgressMeter - Starting traversal
12:01:11.765 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
12:01:12.641 INFO  ProgressMeter -           1:29562369              0.0                 43393        2978924.5
12:01:12.642 INFO  ProgressMeter - Traversal complete. Processed 43393 total variants in 0.0 minutes.
12:01:12.642 INFO  ValidateVariants - Shutting down engine
[July 1, 2018 12:01:12 PM EDT] done. Elapsed time: 0.03 minutes.

Can anyone familiar with the code point me in the right direction?

The VCF header for A.vcf and B.vcf looks as follows:

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">

Issue · Github
by shlee

Issue Number
Last Updated

Best Answer


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    It looks like this error occurs when you have duplicate contigs in your VCF index file or VCF header. Can you check if this is the case?


  • FPBarthelFPBarthel HoustonMember ✭✭

    Hi @Sheila! I’m not sure if that is the problem since the input VCF files don’t have an index and the header don’t have contigs included (I fed reference fasta and dict file to MergeVcfs as command line above). The header for the VCF files im trying to merge is exactly as specified above.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks @shlee, using UpdateVCFSequenceDictionary to re-introduce the contig lines in both input VCF files resolved the issue. However, it does mean that it looks like the documentation for MergeVcfs is innacurate (link). Here it is stated that:

    Optionally a sequence dictionary file (typically name ending in .dict) if the input VCF does not contain a complete contig list and if the output index is to be created (true by default).

    It looks like that If the input VCF does not contain a complete contig list, supplying a sequence dictionary to MergeVcfs as suggested, does not work, and MergeVcfs requires a complete contig list to be present from the start.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Thanks for letting us know. To confirm before I change the doc, the sequence dictionary file is required if the output index is to be created? So, I just need to remove the " if the input VCF does not contain a complete contig list" bit?


  • FPBarthelFPBarthel HoustonMember ✭✭
    edited July 2018

    It does not look like the sequence dictionary is required. The -D parameter accepts a sequence dictionary, but the parameter does not seem to serve any purpose.

    1. If any or all of the input files do not contain a sequence dictionary in the VCF header, the program returns an error, regardless of whether -D is provided
    2. If all input files contain a sequence dictionary in the VCF header, the VCFs are merged as intended, regardless of whether -D is provided, the sequence header is retained in the merged output VCF

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Thanks. I will put in a doc fix.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I've put in a request at for MergeVcfs to take in a provided dictionary towards headering and sorting. Anyone reading this who think this type of feature will be useful, please say so in the GitHub issue ticket.

Sign In or Register to comment.