We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK 4.1.1.0 GenomicsDBImport error : Duplicate fields exist in vid attribute "fields" and 2 errors

WontonWonton MacauMember
Hello GATK team!
I am currently using Mutect2 & FilterMutectCalls & GenomicsDBImport for somatic calling. In the steps of Mutect2 & FilterMutectCalls, I got samples' gVCF fine. However, I want to use GenomicsDBImport to combine all gVCF and this step is not working with several errors and I am running out of ideas. Thank you.
GATK4.1.1.0
1: Mutect2 .bam to .g.vcf, seems ok and generate .vcf, .vcf.idx and .vcf.stats
gatk Mutect2 --reference .../hg19.fa --input ....bam --output ...g.vcf -ERC GVCF --tmp-dir ...
2: FilterMutectCalls .g.vcf to .g.vcf, seems ok and generate .vcf, .vcf.idx and .vcf.filteringStats.tsv
gatk FilterMutectCalls --reference .../hg19.fa --variant ...g.vcf --intervals ...hg19.bed --output ...g.vcf --tmp-dir ...
3. When combination:
gatk GenomicsDBImport --reference .../hg19.fa --sample-name-map ${sample_mapFile} --validate-sample-name-map true --intervals ...hg19.bed --genomicsdb-workspace-path ... --max-num-intervals-to-import-in-parallel 20 --consolidate true --batch-size 100 --merge-input-intervals true --tmp-dir ...
This commend is same and ok at germline pipline. VCF files can also be read now. But got this error:
Duplicate field name TLOD found in vid attribute "fields"
Duplicate field name TLOD found in vid attribute "fields"
terminate called after throwing an instance of 'FileBasedVidMapperException'
terminate called recursively
what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"
4. I deleted this line and re-run:
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log odds ratio score for variant">
Then got this error:
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 171: . is not a valid start position in the VCF format, for input source: file:///home/yb87626/breast/variantCalling/SRR8437498.postM2.g.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:797)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:324)
...
5. I deleted this line and re-run:
##tumor_sample=SAMN10735600
Then got this error:
[July 3, 2019 7:10:52 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.06 minutes.
Runtime.totalMemory()=1243086848
htsjdk.tribble.TribbleException: Line 169: there aren't enough columns for line END=17447;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:1:1:-4.765e-01 (we expected 9 tokens, and saw 3 ), for input source: file:///home/yb87626/breast/variantCalling/SRR8437498.postM2.g.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:296)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:277)
...
You can see the program recognize 'END=17447;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:1:1:-4.765e-01' as a line, but there are columns before them in the same line. The problem may not be this line, because same problem happens at the next line when I delete this line.
Now, I don't know how to solve it. And did I do right before 4 and 5?
Part of .g.vcf:
##fileformat=VCFv4.2
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMN10735589
chr1 1 . N <NON_REF> . PASS END=17405;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:0:0:0.00

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited July 2019

    Hi @Wonton

    Please try and run ValidateVariants on your gvcf files to determine if there is an issue with the vcf format.

  • WontonWonton MacauMember
    Hi bhanuGandham,
    Thank you for your help. No file or report was generated after I run ValidateVariants on the gvcf file. Is that means the gvcf file is ok?
    Here is the command:
    gatk ValidateVariants --reference .../hg19.fa --variant .../SRR8437498.postM2.raw.g.vcf --tmp-dir ... 1> .../report.txt 2>.../gatk.err
    Here is the "gatk.err":
    Using GATK jar /opt/conda/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools...
    16:51:13.393 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.1.1.0-0/gatk-package-4.1.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jul 11, 2019 4:51:15 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    16:51:15.656 INFO ValidateVariants - ------------------------------------------------------------
    16:51:15.656 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.1.0
    16:51:15.656 INFO ValidateVariants - For support and documentation go to
    ...
    16:51:15.657 INFO ValidateVariants - Executing as [email protected] on Linux v3.10.0-693.5.2.el7.x86_64 amd64
    16:51:15.658 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
    16:51:15.658 INFO ValidateVariants - Start Date/Time: July 11, 2019 4:51:13 PM UTC
    16:51:15.658 INFO ValidateVariants - ------------------------------------------------------------
    16:51:15.658 INFO ValidateVariants - ------------------------------------------------------------
    16:51:15.658 INFO ValidateVariants - HTSJDK Version: 2.19.0
    16:51:15.658 INFO ValidateVariants - Picard Version: 2.19.0
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    16:51:15.658 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    16:51:15.658 INFO ValidateVariants - Deflater: IntelDeflater
    16:51:15.658 INFO ValidateVariants - Inflater: IntelInflater
    16:51:15.658 INFO ValidateVariants - GCS max retries/reopens: 20
    16:51:15.658 INFO ValidateVariants - Requester pays: disabled
    16:51:15.658 INFO ValidateVariants - Initializing engine
    16:51:15.857 INFO FeatureManager - Using codec VCFCodec to read file file:///home/yb87626/validate/SRR8437498.postM2.raw.g.vcf
    16:51:15.888 INFO ValidateVariants - Done initializing engine
    16:51:15.888 INFO ProgressMeter - Starting traversal
    16:51:15.888 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    16:51:22.030 INFO ProgressMeter - chrUn_gl000219:99683 0.1 953167 9311302.5
    16:51:22.030 INFO ProgressMeter - Traversal complete. Processed 953167 total variants in 0.1 minutes.
    16:51:22.030 INFO ValidateVariants - Shutting down engine
    [July 11, 2019 4:51:22 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.14 minutes.
    Runtime.totalMemory()=1869086720
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Wonton

    We recommend and support using gVCFs generated from Mutect2 only for Mitochondrial DNA.

  • WontonWonton MacauMember
    Hi bhanuGandham,
    I understand. Thank you for your reply.
  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Hi @Wonton ,

    After you deleted the problematic header line did you reindex the VCF? I think if you run gatk IndexFeatureFile -F <vcfFilePath> and try again the parsing error will go away. We don't support manual editing of VCFs, but I've done it myself and I believe I've resolved the same problem by reindexing.

    -Laura

  • WontonWonton MacauMember
    Hi Laura,
    I understand your suggestion. I don't like to manually edit them too. Thank you very much for your reply.
Sign In or Register to comment.