Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Trouble getting combined gvcf file using either GenomicsDBImport or CombineGVCFs

I am using GenomicsDBImport from GATK4.0.7.0 to combine my g.vcf files after HaplotypeCaller in -ERC mode to put into GenotypeGVCF.
GenomicsDBImport uses up a lot of memory (about 400GB), for 45 g.vcfs (of about 30GB in size in total), presumably to load in the g.vcf files. It throws this line: "A protocol message was rejected because it was too big (more than 67108864 bytes)" , before then running out of heap space.

Here is my command:

gatk GenomicsDBImport --java-options "-Xmx380G -XX:+ParallelGCThreads -XX:ParallelGCThreads=24" -V input.list --genomicsdb-workspace-path 5sp_45ind_assmb --intervals intervallist.intervals 

And this is the stacktrace:

15:51:30.141 INFO  GenomicsDBImport - Initializing engine
16:34:36.890 INFO  IntervalArgumentCollection - Processing 1861814 bp from intervals
16:34:36.900 INFO  GenomicsDBImport - Done initializing engine
Created workspace /home/leq/gvcfs/5sp_45ind_assmb
16:34:37.031 INFO  GenomicsDBImport - Vid Map JSON file will be written to 5sp_45ind_assmb/vidmap.json
16:34:37.032 INFO  GenomicsDBImport - Callset Map JSON file will be written to 5sp_45ind_assmb/callset.json
16:34:37.032 INFO  GenomicsDBImport - Complete VCF Header will be written to 5sp_45ind_assmb/vcfheader.vcf
16:34:37.032 INFO  GenomicsDBImport - Importing to array - 5sp_45ind_assmb/genomicsdb_array
16:34:37.032 INFO  ProgressMeter - Starting traversal
16:34:37.032 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
17:05:14.417 INFO  GenomicsDBImport - Importing batch 1 with 45 samples
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
23:13:58.382 INFO  GenomicsDBImport - Shutting down engine
[August 9, 2018 11:13:58 PM CEST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 442.47 minutes.
Runtime.totalMemory()=362686185472
java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
    at java.lang.StringBuilder.<init>(StringBuilder.java:89)
    at htsjdk.variant.vcf.VCFSimpleHeaderLine.toStringEncoding(VCFSimpleHeaderLine.java:100)
    at htsjdk.variant.vcf.VCFHeaderLine.toString(VCFHeaderLine.java:97)
    at htsjdk.variant.variantcontext.writer.VCFWriter.writeHeader(VCFWriter.java:165)
    at htsjdk.variant.variantcontext.writer.VCFWriter.writeHeader(VCFWriter.java:139)
    at com.intel.genomicsdb.importer.GenomicsDBImporterStreamWrapper.<init>(GenomicsDBImporterStreamWrapper.java:88)
    at com.intel.genomicsdb.importer.GenomicsDBImporter.addBufferStream(GenomicsDBImporter.java:397)
    at com.intel.genomicsdb.importer.GenomicsDBImporter.addSortedVariantContextIterator(GenomicsDBImporter.java:358)
    at com.intel.genomicsdb.importer.GenomicsDBImporter.<init>(GenomicsDBImporter.java:167)
    at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:604)
    at com.intel.genomicsdb.importer.GenomicsDBImporter$$Lambda$73/1933493643.get(Unknown Source)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
    ... 3 more

I checked my g.vcf files using ValidateVariants in -gvcf mode, and traversal was completed with x number of variants. There was no error message.

I have tried CombineGVCFs, where it requires 320GB to finish loading in all the g.vcf files, then runs out of heap space. Does it require so much memory for this step?

Here is the command and stacktrace for CombineGVCFs.

gatk CombineGVCFs --java-options "-Xmx350G -XX:+UseParallelGC -XX:ParallelGCThreads=40" -R Draft_assembly_ctg.fasta -V input.list -O combined.g.vcf
7:48:43.628 INFO  CombineGVCFs - Shutting down engine
[August 14, 2018 5:48:43 PM CEST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 419.81 minutes.
Runtime.totalMemory()=305573920768
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
    at java.lang.StringBuilder.<init>(StringBuilder.java:89)
    at htsjdk.variant.vcf.VCF4Parser.parseLine(VCFHeaderLineTranslator.java:76)
    at htsjdk.variant.vcf.VCFHeaderLineTranslator.parseLine(VCFHeaderLineTranslator.java:51)
    at htsjdk.variant.vcf.VCFSimpleHeaderLine.<init>(VCFSimpleHeaderLine.java:66)
    at htsjdk.variant.vcf.VCFContigHeaderLine.<init>(VCFContigHeaderLine.java:52)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseHeaderFromLines(AbstractVCFCodec.java:212)
    at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:111)
    at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
    at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
    at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:261)
    at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:102)
    at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:127)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:113)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:359)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:314)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:261)
    at org.broadinstitute.hellbender.engine.MultiVariantDataSource.lambda$new$0(MultiVariantDataSource.java:89)
    at org.broadinstitute.hellbender.engine.MultiVariantDataSource$$Lambda$68/591247593.accept(Unknown Source)
    at java.util.ArrayList.forEach(ArrayList.java:1249)
    at org.broadinstitute.hellbender.engine.MultiVariantDataSource.<init>(MultiVariantDataSource.java:88)
    at org.broadinstitute.hellbender.engine.MultiVariantWalker.initializeDrivingVariants(MultiVariantWalker.java:76)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:48)
    at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:638)
    at org.broadinstitute.hellbender.engine.MultiVariantWalker.onStartup(MultiVariantWalker.java:44)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

Many thanks!
Quinn

Best Answer

Answers

  • QuinnCQuinnC Member

    Hi Sheila,

    Thank you for the inspiration! I'll try with --batchSize and splitting up the intervals (which is a list of contigs) instead of trying to combine everything at one go. Will update with my progress :)

    Cheers,
    Quinn

  • QuinnCQuinnC Member

    I have tried with the following:

    gatk GenomicsDBImport \
    --java-options "-Xmx250G -XX:+UseParallelGC -XX:ParallelGCThreads=24" \
    -V input.list \
    --genomicsdb-workspace-path 5sp_45ind_assmb_00 \
    --intervals interval.00.list \
    --batch-size 9 
    

    where I have split my list of contigs into 50 lists, and set batch size as 9 (instead of reading in 45 g.vcf at once). It looks like it has started to run, but terminated quickly after an error.

    The resulting stack trace is:

    00:53:23.869 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
    00:53:23.869 INFO  GenomicsDBImport - Picard Version: 2.18.7
    00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    00:53:23.869 INFO  GenomicsDBImport - Deflater: IntelDeflater
    00:53:23.869 INFO  GenomicsDBImport - Inflater: IntelInflater
    00:53:23.869 INFO  GenomicsDBImport - GCS max retries/reopens: 20
    00:53:23.869 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    00:53:23.869 INFO  GenomicsDBImport - Initializing engine
    01:26:13.490 INFO  IntervalArgumentCollection - Processing 58057410 bp from intervals
    01:26:13.517 INFO  GenomicsDBImport - Done initializing engine
    Created workspace /home/leq/gvcfs/5sp_45ind_assmb_00
    01:26:13.655 INFO  GenomicsDBImport - Vid Map JSON file will be written to 5sp_45ind_assmb_00/vidmap.json
    01:26:13.655 INFO  GenomicsDBImport - Callset Map JSON file will be written to 5sp_45ind_assmb_00/callset.json
    01:26:13.655 INFO  GenomicsDBImport - Complete VCF Header will be written to 5sp_45ind_assmb_00/vcfheader.vcf
    01:26:13.655 INFO  GenomicsDBImport - Importing to array - 5sp_45ind_assmb_00/genomicsdb_array
    01:26:13.656 INFO  ProgressMeter - Starting traversal
    01:26:13.656 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
    01:33:16.970 INFO  GenomicsDBImport - Importing batch 1 with 9 samples
    [libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
    Contig/chromosome ctg7180018354961 begins at TileDB column 0 and intersects with contig/chromosome ctg7180018354960 that spans columns [1380207667, 1380207970] terminate called after throwing an instance of 'ProtoBufBasedVidMapperException' what():  
    ProtoBufBasedVidMapperException : Overlapping contigs found
    

    How do I overcome this issue of 'overlapping contigs found'?

Sign In or Register to comment.