Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenomicsDBImport not completing for mixed ploidy samples

livinlrglivinlrg University of WashingtonMember
I'm attempting to call variants on whole genomes for about 500 illumina paired-end samples with varying ploidy (haploid to tetraploid). I'm running a fairly standard uBam to GVCF pipeline with HaplotypeCaller passed the ploidy information (1,2,3, or 4) in -ERC GVCF mode. I then try to collect the GVCFs using GenomicsDBImport in a batch size of 50 and use GenotypeGVCFs on the combined database. My interval list that is passed to GenomicsDBImport is just each chromosome on a separate line. I'm using GATK v4.1.1.0

Command:
```
${GATK_DIR}/gatk GenomicsDBImport \
--java-options "-Xmx110g -Xms110g" \
-R ${REF} \
--variant ${FILE_LIST} \
-L ${SCRIPT_DIR}/GATK_Style_Interval.list \
--genomicsdb-workspace-path ${WORK_DIR}/GenomicsDB_20190912 \
--batch-size 50 \
--tmp-dir=${WORK_DIR}/
```

GenomicsDBImport appears to run without error, but only shows progress for the first 6000 bp before moving onto the next batch. When I run select variants on the created database, I only get variants up to position 6716 in the first interval. When I try to run GenotypeGVCF on it, I get a strange error:
htsjdk.tribble.TribbleException: Invalid block size -1570639203

My first assumption is that one of the gvcf's is malformed from HaplotypeCaller failing after the first 6000 bp, but I've verified that the gvcfs have all completed and have 'validated' them with ValidateVariants using GATK v4.1.3.0. When I grep for the particular position in the sample's gvcfs I don't find anything out of the ordinary. I would use CombineGVCFs, but it fails due to trying to combine mixed ploidies.

Any ideas on troubleshooting or experience with problems like this?

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @livinlrg, can you paste the entire error log, please?

  • livinlrglivinlrg University of WashingtonMember
    Hello @Tiffany_at_Broad ,

    Thank you for responding. I'm quite desperate at this point to resolve this issue. Below is the full entry generated by GenotypeGVCFs. Prior to this GenomicsDBImport ran seemingly normally, but only producing one entry in the progress meter, which I suspect means that it somehow stopped in the first interval. Please let me know if you want me to post the output from GenomicsDBImport

    GenotypeGVCFs Output:

    '''
    Using GATK jar /net/gs/vol3/software/modules-sw/GATK/4.1.1.0/Linux/RHEL6/x86_64/gatk-package-4.1.1.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx12g -Xms12g -jar /net/gs/vol3/software/modules-sw/GATK/4.1.1.0/Linux/RHEL6/x86_64/gatk-package-4.1.1.0-local.jar GenotypeGVCFs -R /ReferenceGenome/Reference_2Micron/sacCer3_2micron.fasta -V gendb://GenomicsDB_20190910 -O /WorkDirectory_Combined/FinishedToDate_20190909.vcf
    13:25:33.664 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/net/gs/vol3/software/modules-sw/GATK/4.1.1.0/Linux/RHEL6/x86_64/gatk-package-4.1.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Sep 11, 2019 1:25:36 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    13:25:36.623 INFO GenotypeGVCFs - ------------------------------------------------------------
    13:25:36.623 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.1.0
    13:25:36.624 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    13:25:36.624 INFO GenotypeGVCFs - Executing as livinlrg on Linux v2.6.32-754.14.2.el6.x86_64 amd64
    13:25:36.624 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_25-b17
    13:25:36.624 INFO GenotypeGVCFs - Start Date/Time: September 11, 2019 1:25:33 PM PDT
    13:25:36.624 INFO GenotypeGVCFs - ------------------------------------------------------------
    13:25:36.624 INFO GenotypeGVCFs - ------------------------------------------------------------
    13:25:36.625 INFO GenotypeGVCFs - HTSJDK Version: 2.19.0
    13:25:36.625 INFO GenotypeGVCFs - Picard Version: 2.19.0
    13:25:36.625 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    13:25:36.625 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    13:25:36.625 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    13:25:36.625 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    13:25:36.625 INFO GenotypeGVCFs - Deflater: IntelDeflater
    13:25:36.625 INFO GenotypeGVCFs - Inflater: IntelInflater
    13:25:36.626 INFO GenotypeGVCFs - GCS max retries/reopens: 20
    13:25:36.626 INFO GenotypeGVCFs - Requester pays: disabled
    13:25:36.626 INFO GenotypeGVCFs - Initializing engine
    WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    13:25:38.438 INFO GenotypeGVCFs - Done initializing engine
    13:25:38.527 INFO ProgressMeter - Starting traversal
    13:25:38.527 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    13:28:15.290 INFO ProgressMeter - chrI:1000 2.6 1000 382.7
    13:28:32.209 INFO ProgressMeter - chrI:2000 2.9 2000 690.9
    13:28:45.148 INFO ProgressMeter - chrI:3000 3.1 3000 964.5
    13:28:57.408 INFO ProgressMeter - chrI:5001 3.3 5000 1508.4
    13:29:29.163 INFO GenotypeGVCFs - Shutting down engine
    GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),28.43237578199998,Cpu time(s),27.069680313999996
    [September 11, 2019 1:29:29 PM PDT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 3.94 minutes.
    Runtime.totalMemory()=12863930368
    htsjdk.tribble.TribbleException: Invalid block size -1497008715
    at htsjdk.variant.bcf2.BCF2Decoder.readNextBlock(BCF2Decoder.java:66)
    at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:134)
    at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:58)
    at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:156)
    at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:45)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:423)
    at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:129)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:984)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
    '''
  • bshifawbshifaw Member, Broadie, Moderator admin

    Do you receive the same error using GATK4.1.3.0? If so can you pass along a sample gvcf that would produce the same error. BugReport

  • livinlrglivinlrg University of WashingtonMember
    I do receive the error when using GATK4.1.3.0 to run GenomicsDBImport to GenotypeGVCFs on all of my samples. When I run a reduced list of 11 samples (total dataset is 449 samples), there are no errors produced. So I'm afraid I can't give you sample GVCF's that will recreate the error without sending over a ton of files. I have produced a BugReport under the name livinlrg_Errors.zip. It includes my reference genome and associated dict files, interval list, pipeline to produce the gvcfs, and the error file.

    In the process of troubleshooting I have discovered a number of related issues:
    1) I have narrowed down the region in which the program fails to within 300 bp on chrI. When I avoid the problem interval, chrI, I still get an error in the next interval, chrII (livinlrg_IntervalTest_Error.txt). It is, however, an OutOfBounds error.
    2) When I split the aggregated dataset into databases based on the study they are from GenotypeGVCFs is able to transverse the problem interval.

    If you think it'll be helpful, I can grep the 300bp problem interval and header from the 449 sample g.vcfs. Please let me know what other info I can provide to try to diagnose this problem.
  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @livinlrg,

    I spoke with my team and they suspect it may be related to memory usage. You maybe specifying to much memory in your command line, try reducing your memory down to 100g Xmx and Xms.

    If the memory change doesn't work then please grep the problem intervals as you mentioned earlier.

    Would you also please describe your server such as memory, disk, filesystem, etc..

Sign In or Register to comment.