Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

GenomicsDBImport terminates after Overlapping contigs found error

QuinnCQuinnC Member
edited October 15 in Ask the GATK team

My original query was about batching and making intervals for GenomicsDBImport, but I have run into a new problem. I am using version 4.0.7.0 I tried the following:

gatk GenomicsDBImport \
--java-options "-Xmx250G -XX:+UseParallelGC -XX:ParallelGCThreads=24" \
-V input.list \
--genomicsdb-workspace-path 5sp_45ind_assmb_00 \
--intervals interval.00.list \
--batch-size 9 

where I have split my list of contigs into 50 lists, and set batch size as 9 (instead of reading in 45 g.vcf at once) for a total of 5 batches. It looks like it has started to run, but terminated quickly after an error.

The resulting stack trace is:

00:53:23.869 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
00:53:23.869 INFO  GenomicsDBImport - Picard Version: 2.18.7
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:53:23.869 INFO  GenomicsDBImport - Deflater: IntelDeflater
00:53:23.869 INFO  GenomicsDBImport - Inflater: IntelInflater
00:53:23.869 INFO  GenomicsDBImport - GCS max retries/reopens: 20
00:53:23.869 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
00:53:23.869 INFO  GenomicsDBImport - Initializing engine
01:26:13.490 INFO  IntervalArgumentCollection - Processing 58057410 bp from intervals
01:26:13.517 INFO  GenomicsDBImport - Done initializing engine
Created workspace /home/leq/gvcfs/5sp_45ind_assmb_00
01:26:13.655 INFO  GenomicsDBImport - Vid Map JSON file will be written to 5sp_45ind_assmb_00/vidmap.json
01:26:13.655 INFO  GenomicsDBImport - Callset Map JSON file will be written to 5sp_45ind_assmb_00/callset.json
01:26:13.655 INFO  GenomicsDBImport - Complete VCF Header will be written to 5sp_45ind_assmb_00/vcfheader.vcf
01:26:13.655 INFO  GenomicsDBImport - Importing to array - 5sp_45ind_assmb_00/genomicsdb_array
01:26:13.656 INFO  ProgressMeter - Starting traversal
01:26:13.656 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
01:33:16.970 INFO  GenomicsDBImport - Importing batch 1 with 9 samples
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Contig/chromosome ctg7180018354961 begins at TileDB column 0 and intersects with contig/chromosome ctg7180018354960 that spans columns [1380207667, 1380207970] terminate called after throwing an instance of 'ProtoBufBasedVidMapperException' what():  
ProtoBufBasedVidMapperException : Overlapping contigs found

How do I overcome this issue of 'overlapping contigs found'? Is there a problem with my set of contigs? Also, is the warning about protocol messages something to worry about?

Thank you!

Answers

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @QuinnC

    Would you please post the contents of your interval list file (interval.00.list).
    Thank you.

    Regards
    Bhanu

  • QuinnCQuinnC Member

    Hi @bhanuGandham

    Thanks for looking into this! :)
    Here are the first 10 lines for my interval.00.list:

    ctg7180024710462
    ctg7180024952859
    ctg7180025043953
    ctg7180024953445
    ctg7180024780449
    ctg7180024953454
    ctg7180024361837
    ctg7180024716175
    ctg7180025056092
    ctg7180025038359
    

    but then I remembered that I hadn't specified the intervals for the contigs, so I redid it again, with just 2 intervals which I know do not overlap to test things out:

    ctg7180024710462:1697-2196
    ctg7180025043953:2192-2751
    

    However, it terminated with the exact same message as last time:

    GenomicsDBImport - Importing batch 1 with 9 samples
    [libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
    Contig/chromosome ctg7180018354961 begins at TileDB column 0 and intersects with contig/chromosome ctg7180018354960 that spans columns [1380207667, 1380207970]
    terminate called after throwing an instance of 'ProtoBufBasedVidMapperException'
      what():  ProtoBufBasedVidMapperException : Overlapping contigs found
    

    I doubt the interval list file was the problem, but I did map the reads against a draft genome where the contigs definitely overlap, instead of the targets in order to recover all relevant reads. A collaborator had successfully obtained snps by mapping against a draft genome in GATK3.8 (using HC in ERC mode and joint genotyping).

    Thank you!

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @QuinnC

    Our dev team is looking into this and will get back to you soon.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @QuinnC

    The error message indicates that the problem isn't with the intervals file, but is actually with the vcf files (which is why it's complaining about overlapping contigs that aren't even in the intervals). It might be interesting to try it with only one interval: probably the error will persist.

    Things to try:
    1. I'm worried that the protobuf error messages may be corrupting info in the vcf headers so that the DB is seeing incorrect/inconsistent info. I think maybe adding the option
    --genomicsdb-segment-size 100000000
    might properly size the protocol buffers (although it's hard to be sure because most of the code is 3rd party)

    1. This is a longshot, likely not applicable: I've noticed some tools will occasionally not list all reference contigs in their vcf header lines (e.g. if no variants were found there). If the vcfs were not originally generated by a gatk tool, it might be worth checking your vcfs to ensure that their contig header lines are all identical, because otherwise the genomics DB might create an inconsistent mapping.

    2. If 1 and 2 don't work: to diagnose this properly (and potentially make a formal "issue" that we can escalate to get Intel to fix) it would be helpful to create a stripped-down set of say, 2 vcfs that cause this error, if you would be willing to share them.

    Regards
    Bhanu

  • QuinnCQuinnC Member

    Hi @bhanuGandham,

    Adding --genomicsdb-segment-size 100000000 still gives the protocol buffer error. These g.vcf files were generated using gatk4's HaplotypeCaller. I'd be happy to share my g.vcf files, how should I go about sharing them? Will you require the accompanying .idx files?
    Thank you!

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @QuinnC

    You can share you g.vcf file by following instructions provided here. Yes please also share your .idx files.

    Regards
    Bhanu

  • QuinnCQuinnC Member

    Hi @bhanuGandham,

    The file is called twovcfs1.tar.gz. Thank you! :)

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @QuinnC

    Sorry about the delay our team was away on thanksgiving holiday. Our dev team will look into this and get back to you.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @QuinnC

    Our dev team looked into this and it appears to s that if you upgraded to gatk 4.0.11.0 version this issue might be resolved. Would you please try this option.

    Regards
    Bhanu

Sign In or Register to comment.