Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Duplicate field error in GenomicsDBImport

foxyjohnfoxyjohn Member
edited September 6 in Ask the GATK team


I've got a question about an error generated using GenomicsDBImport.

The gVCF I'm trying to import has three offending duplicate field names (BR, MQ & QD). Checking the header, I notice that each name is duplicated in both the INFO and FILTER fields (see below). I also notice other non-offending names (DP for example) that are duplicated in INFO and in FORMAT fields but which don't seem to cause any bother.

My question is why are duplicate names allowed in INFO & FORMAT fields but not in FILTER?
Secondly and more importantly is there some way (other than to rejig all my headers & data) to tell GenomicsDBImport that the duplicate names belong to different fields when creating the vid attributes file, or to maybe switch off the check?

Thanks, Sean.

=== offending field ===
INFO=<ID=QD,Number=A,Type=Float,Description="Ratio of phred-scaled posterior probability (PP) to number of supporting reads for each allele (VC).">
FILTER=<ID=QD,Description="Quality over Depth: Indicates low quality relative to number of supporting reads (any of INFO::QD < 15 for Indels or INFO::QD < 15 otherwise).">

INFO=<ID=BR,Number=A,Type=Float,Description="The median of the per-read min base quality (within a interval of the locus) taken over reads supporting each allele.">
FILTER=<ID=BR,Description="Bad Reads: Indicates low quality base pairs on reads in the vicinity of variant locus (any of INFO::BR < 15).">

=== non-offending field
INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of read coverage at this locus.">
FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of reads overlapping the variant site (i.e. INFO::DP split out by sample). For reference calls the average depth (rounded to the nearest integer) over the region is reported.">


  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi foxyjohn Can you please provide the version of GATK used, exact tool command, and the entire error log?

  • foxyjohnfoxyjohn Member

    GATK version is

    _Command: _
    gatk_megs=$(head -n1 /proc/meminfo | awk '{print int(0.9*($2/1024))}');
    gatk --java-options "-Xmx${gatk_megs}m" GenomicsDBImport --genomicsdb-workspace-path pon_db -V GHS_PT100006_233694007.gvcf.gz -L xgen_plus_spikein.b38.bed --batch-size 50 --reader-threads 5 --tmp-dir=./tmp2

    Error log:
    Using GATK jar /mnt/PoN_gvcf/gatk-
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx14441m -jar /mnt/PoN_gvcf/gatk- GenomicsDBImport --genomicsdb-workspace-path pon_db -V GHS_PT100006_233694007.gvcf.gz -L xgen_plus_spikein.b38.bed --batch-size 50 --reader-threads 5 --tmp-dir=./tmp2
    16:20:56.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/PoN_gvcf/gatk-!/com/intel/gkl/native/libgkl_compression.so
    16:20:57.244 INFO GenomicsDBImport - ------------------------------------------------------------
    16:20:57.244 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.2.0
    16:20:57.245 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    16:20:57.245 INFO GenomicsDBImport - Executing as [email protected] on Linux v4.4.0-1090-aws amd64
    16:20:57.245 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10
    16:20:57.245 INFO GenomicsDBImport - Start Date/Time: September 9, 2019 4:20:56 PM UTC
    16:20:57.249 INFO GenomicsDBImport - ------------------------------------------------------------
    16:20:57.249 INFO GenomicsDBImport - ------------------------------------------------------------
    16:20:57.252 INFO GenomicsDBImport - HTSJDK Version: 2.19.0
    16:20:57.252 INFO GenomicsDBImport - Picard Version: 2.19.0
    16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    16:20:57.252 INFO GenomicsDBImport - Deflater: IntelDeflater
    16:20:57.253 INFO GenomicsDBImport - Inflater: IntelInflater
    16:20:57.253 INFO GenomicsDBImport - GCS max retries/reopens: 20
    16:20:57.253 INFO GenomicsDBImport - Requester pays: disabled
    16:20:57.253 INFO GenomicsDBImport - Initializing engine
    16:20:57.921 INFO FeatureManager - Using codec BEDCodec to read file file:///mnt/PoN_gvcf/xgen_plus_spikein.b38.bed
    16:20:58.514 INFO IntervalArgumentCollection - Processing 38997831 bp from intervals
    16:20:58.591 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
    16:20:58.594 INFO GenomicsDBImport - Done initializing engine
    16:20:58.829 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/PoN_gvcf/pon_db/vidmap.json
    16:20:58.829 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/PoN_gvcf/pon_db/callset.json
    16:20:58.829 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/PoN_gvcf/pon_db/vcfheader.vcf
    16:20:58.829 INFO GenomicsDBImport - Importing to array - /mnt/PoN_gvcf/pon_db/genomicsdb_array
    16:20:58.830 WARN GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.
    16:20:58.830 INFO ProgressMeter - Starting traversal
    16:20:58.830 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
    16:26:23.008 INFO GenomicsDBImport - Importing batch 1 with 1 samples
    Duplicate field name BR found in vid attribute "fields"
    Duplicate field name MQ found in vid attribute "fields"
    Duplicate field name QD found in vid attribute "fields"
    terminate called after throwing an instance of 'FileBasedVidMapperException'
    what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"

    Issue · Github
    by Tiffany_at_Broad

    Issue Number
    Last Updated
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @foxyjohn I think the comments in this closed ticket explain what is happening. I am going to raise a ticket for the dev team to determine another way of getting around this "other than to rejig all my headers & data," as you said.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
Sign In or Register to comment.