If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK GenomicsDBImport error : Duplicate fields exist in vid attribute "fields" and 2 errors

WontonWonton MacauMember
Hello GATK team!
I am currently using Mutect2 & FilterMutectCalls & GenomicsDBImport for somatic calling. In the steps of Mutect2 & FilterMutectCalls, I got samples' gVCF fine. However, I want to use GenomicsDBImport to combine all gVCF and this step is not working with several errors and I am running out of ideas. Thank you.
1: Mutect2 .bam to .g.vcf, seems ok and generate .vcf, .vcf.idx and .vcf.stats
gatk Mutect2 --reference .../hg19.fa --input ....bam --output ...g.vcf -ERC GVCF --tmp-dir ...
2: FilterMutectCalls .g.vcf to .g.vcf, seems ok and generate .vcf, .vcf.idx and .vcf.filteringStats.tsv
gatk FilterMutectCalls --reference .../hg19.fa --variant ...g.vcf --intervals ...hg19.bed --output ...g.vcf --tmp-dir ...
3. When combination:
gatk GenomicsDBImport --reference .../hg19.fa --sample-name-map ${sample_mapFile} --validate-sample-name-map true --intervals ...hg19.bed --genomicsdb-workspace-path ... --max-num-intervals-to-import-in-parallel 20 --consolidate true --batch-size 100 --merge-input-intervals true --tmp-dir ...
This commend is same and ok at germline pipline. VCF files can also be read now. But got this error:
Duplicate field name TLOD found in vid attribute "fields"
Duplicate field name TLOD found in vid attribute "fields"
terminate called after throwing an instance of 'FileBasedVidMapperException'
terminate called recursively
what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"
4. I deleted this line and re-run:
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log odds ratio score for variant">
Then got this error:
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 171: . is not a valid start position in the VCF format, for input source: file:///home/yb87626/breast/variantCalling/SRR8437498.postM2.g.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(
5. I deleted this line and re-run:
Then got this error:
[July 3, 2019 7:10:52 AM UTC] done. Elapsed time: 0.06 minutes.
htsjdk.tribble.TribbleException: Line 169: there aren't enough columns for line END=17447;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:1:1:-4.765e-01 (we expected 9 tokens, and saw 3 ), for input source: file:///home/yb87626/breast/variantCalling/SRR8437498.postM2.g.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(
at htsjdk.variant.vcf.AbstractVCFCodec.decode(
You can see the program recognize 'END=17447;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:1:1:-4.765e-01' as a line, but there are columns before them in the same line. The problem may not be this line, because same problem happens at the next line when I delete this line.
Now, I don't know how to solve it. And did I do right before 4 and 5?
Part of .g.vcf:
chr1 1 . N <NON_REF> . PASS END=17405;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:0:0:0.00


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 11

    Hi @Wonton

    Please try and run ValidateVariants on your gvcf files to determine if there is an issue with the vcf format.

  • WontonWonton MacauMember
    Hi bhanuGandham,
    Thank you for your help. No file or report was generated after I run ValidateVariants on the gvcf file. Is that means the gvcf file is ok?
    Here is the command:
    gatk ValidateVariants --reference .../hg19.fa --variant .../SRR8437498.postM2.raw.g.vcf --tmp-dir ... 1> .../report.txt 2>.../gatk.err
    Here is the "gatk.err":
    Using GATK jar /opt/conda/share/gatk4-
    java -Dsamjdk.use_async_io_read_samtools...
    16:51:13.393 INFO NativeLibraryLoader - Loading from jar:file:/opt/conda/share/gatk4-!/com/intel/gkl/native/
    Jul 11, 2019 4:51:15 PM runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    16:51:15.656 INFO ValidateVariants - ------------------------------------------------------------
    16:51:15.656 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.1.0
    16:51:15.656 INFO ValidateVariants - For support and documentation go to
    16:51:15.657 INFO ValidateVariants - Executing as [email protected] on Linux v3.10.0-693.5.2.el7.x86_64 amd64
    16:51:15.658 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
    16:51:15.658 INFO ValidateVariants - Start Date/Time: July 11, 2019 4:51:13 PM UTC
    16:51:15.658 INFO ValidateVariants - ------------------------------------------------------------
    16:51:15.658 INFO ValidateVariants - ------------------------------------------------------------
    16:51:15.658 INFO ValidateVariants - HTSJDK Version: 2.19.0
    16:51:15.658 INFO ValidateVariants - Picard Version: 2.19.0
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    16:51:15.658 INFO ValidateVariants - Deflater: IntelDeflater
    16:51:15.658 INFO ValidateVariants - Inflater: IntelInflater
    16:51:15.658 INFO ValidateVariants - GCS max retries/reopens: 20
    16:51:15.658 INFO ValidateVariants - Requester pays: disabled
    16:51:15.658 INFO ValidateVariants - Initializing engine
    16:51:15.857 INFO FeatureManager - Using codec VCFCodec to read file file:///home/yb87626/validate/SRR8437498.postM2.raw.g.vcf
    16:51:15.888 INFO ValidateVariants - Done initializing engine
    16:51:15.888 INFO ProgressMeter - Starting traversal
    16:51:15.888 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    16:51:22.030 INFO ProgressMeter - chrUn_gl000219:99683 0.1 953167 9311302.5
    16:51:22.030 INFO ProgressMeter - Traversal complete. Processed 953167 total variants in 0.1 minutes.
    16:51:22.030 INFO ValidateVariants - Shutting down engine
    [July 11, 2019 4:51:22 PM UTC] done. Elapsed time: 0.14 minutes.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Wonton

    We recommend and support using gVCFs generated from Mutect2 only for Mitochondrial DNA.

  • WontonWonton MacauMember
    Hi bhanuGandham,
    I understand. Thank you for your reply.
  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @Wonton ,

    After you deleted the problematic header line did you reindex the VCF? I think if you run gatk IndexFeatureFile -F <vcfFilePath> and try again the parsing error will go away. We don't support manual editing of VCFs, but I've done it myself and I believe I've resolved the same problem by reindexing.


  • WontonWonton MacauMember
    Hi Laura,
    I understand your suggestion. I don't like to manually edit them too. Thank you very much for your reply.
Sign In or Register to comment.