We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

CombineGVCFs quits 25% of the way through genome

kaltendokaltendo Minnesota, USAMember
Using CombineGVCFs, I get a java error message and it stops running on Chromosome 6 of 21. I have combined each flowcell and lane combination into their own cohort, so this step is combining all of them for GenotypeGVCFs.

I am using using best practices, except only version GATK 4.1.2 because that is what's available on my supercomputing platform. I know I probably should use GenomicsDB but I tried that it it was terribly slow, plus I've already invested a lot of time in getting to this stage with CombineGVCFs.

I have tried checking the cohort files and they all appear to have data for 21 chromosomes.

Thanks for any insight you can provide.


````
# command
gatk CombineGVCFs --java-options "-Xmx64g" -R $REFERENCE \
--variant CBEEGANXX_1_cohort.g.vcf \
--variant CBEEGANXX_2_cohort.g.vcf \
--variant CBEEGANXX_3_cohort.g.vcf \
--variant CBEEGANXX_4_cohort.g.vcf \
--variant CBEEGANXX_5_cohort.g.vcf \
--variant CBEEGANXX_6_cohort.g.vcf \
--variant CBEEGANXX_7_cohort.g.vcf \
--variant CBEEGANXX_8_cohort.g.vcf \
--variant CBFULANXX_6_cohort.g.vcf \
--variant CBFULANXX_7_cohort.g.vcf \
--variant CBFULANXX_8_cohort.g.vcf \
--variant CC680ANXX_1_cohort.g.vcf \
--variant CC680ANXX_2_cohort.g.vcf \
--variant CC680ANXX_3_cohort.g.vcf \
--variant CC680ANXX_4_cohort.g.vcf \
-O /scratch.global/kaltendo/gatk_temp/NAM_GATK/GenotypeGVCF/cohort2.g.vcf.gz
````
# error message
````
#01:41:05.306 INFO ProgressMeter - Chr06:536254315 311.5 27769000 89140.7
#01:41:32.376 INFO CombineGVCFs - Shutting down engine
#[December 16, 2019 1:41:32 AM CST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed #time: 314.54 minutes.
#Runtime.totalMemory()=2542796800
#java.lang.ArrayIndexOutOfBoundsException: 32770
# at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:147)
# at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
# at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
````

Best Answer

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @kaltendo

    1. Can you try validating your input GVCF with ValidateVariants? You can use --validate-GVCF argument.
    2. How many samples are running CombineGVCFs on?
  • kaltendokaltendo Minnesota, USAMember
    Hi @bhanuGandham

    I ran this on one of my files as a test and I got this error:

    A USER ERROR has occurred: In a GVCF all records must ordered. Record: [VC Unknown @ Chr02:1-2214 Q. of type=SYMBOLIC alleles=[G*, <NON_REF>] attr={END=2214} filters= covers a position previously traversed.

    Could this be related? This error is on Chr 02, but CombineGVCFs seems to have made it through this region just fine before when it made it to Chr 06.

    I have a total of 1350 samples. I first combined them into cohorts of about ~90, then this question is relating to my attempt to combine the final cohorts.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @kaltendo

    This issue should e resolved in the most recent version of ValidateVariants, take a look at this issue ticket: https://github.com/broadinstitute/gatk/issues/6023

    Can you please try with the latest GATK version ValidateVariants?

  • kaltendokaltendo Minnesota, USAMember
    edited December 2019
    Hi @bhanuGandham

    Thank you. I ran my cohort files through the latest version of ValidateVariants using the option you suggested. They appear fine with a few exceptions:

    1) They ran at a highly variable rate, some took 2 minutes, others 2 hours. I'm not sure if this is a problem or a red flag.

    2) Since the genome I'm using has a 2000+ unanchored scaffolds that I don't need, I specified intervals (-L) for Chromsomes in HaplotypeCaller. Thus, it appears the header, which includes the scaffolds, is throwing an error (see below) since the GVCFs do not have data for them.

    Despite these two things, I get no other warnings. I also did not have any problems using CombineGVCFs within a flowcell-lane combination. It's just now that I'm trying to combine across them. I also tried excluding (-XL) Chr 6, which was causing the initial CombineGVCFs error, but it happened again on Chr 15.

    Here's an example error message from ValidateVariants:

    10:36:06.259 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:
    file:
    4.1-0/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Dec 18, 2019 10:36:12 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCr
    edentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    10:36:12.266 INFO ValidateVariants - ------------------------------------------
    ------------------
    10:36:12.275 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.4.
    1
    10:36:12.275 INFO ValidateVariants - For support and documentation go to https:
    //software.broadinstitute.org/gatk/
    10:36:12.276 INFO ValidateVariants - Executing as [email protected] on Linux v3.1
    0.0-1062.4.3.el7.x86_64 amd64
    10:36:12.276 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.
    8.0_152-release-1056-b12
    10:36:12.276 INFO ValidateVariants - Start Date/Time: December 18, 2019 10:36:0
    6 AM CST
    10:36:12.276 INFO ValidateVariants - ------------------------------------------
    ------------------
    10:36:12.276 INFO ValidateVariants - ------------------------------------------
    ------------------
    10:36:12.277 INFO ValidateVariants - HTSJDK Version: 2.21.0
    10:36:12.278 INFO ValidateVariants - Picard Version: 2.21.2
    10:36:12.278 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    10:36:12.278 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMT
    OOLS : false
    10:36:12.278 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAM
    TOOLS : true
    10:36:12.278 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRI
    BBLE : false
    10:36:12.278 INFO ValidateVariants - Deflater: IntelDeflater
    10:36:12.278 INFO ValidateVariants - Inflater: IntelInflater
    10:36:12.278 INFO ValidateVariants - GCS max retries/reopens: 20
    10:36:12.278 INFO ValidateVariants - Requester pays: disabled
    10:36:12.279 INFO ValidateVariants - Initializing engine
    10:36:14.720 INFO FeatureManager - Using codec VCFCodec to read file file:
    10:36:15.337 INFO ValidateVariants - Done initializing engine
    10:36:15.347 WARN ValidateVariants - GVCF format is currently incompatible with
    allele validation. Not validating Alleles.
    10:36:15.347 WARN ValidateVariants - IDS validation cannot be done because no D
    BSNP file was provided
    10:36:15.347 WARN ValidateVariants - Other possible validations will still be p
    erformed
    10:36:15.347 INFO ProgressMeter - Starting traversal
    10:36:15.347 INFO ProgressMeter - Current Locus Elapsed Minutes Vari
    ants Processed Variants/Minute
    10:36:25.762 INFO ProgressMeter - Chr01:115796019 0.2
    82000 472667.9
    10:36:35.764 INFO ProgressMeter - Chr01:243694790 0.3
    168000 493706.2
    10:36:45.880 INFO ProgressMeter - Chr01:369541438 0.5
    263000 517139.7
    10:36:55.888 INFO ProgressMeter - Chr01:448039979 0.7
    351000 519486.9
    10:37:11.241 INFO ProgressMeter - Chr02:2186511 0.9
    411000 441271.2
    10:37:21.254 INFO ProgressMeter - Chr02:146246437 1.1
    547000 498004.6
    10:37:31.359 INFO ProgressMeter - Chr02:243465279 1.3
    629000 496507.1
    10:37:41.362 INFO ProgressMeter - Chr03:271469402 1.4
    1030000 718487.7
    10:37:51.370 INFO ProgressMeter - Chr05:319939387 1.6
    1674000 1045999.4
    10:38:03.202 INFO ProgressMeter - Chr07:350319 1.8
    2255000 1254473.6
    10:38:13.201 INFO ProgressMeter - Chr08:411241447 2.0
    3019000 1536986.4
    10:38:23.213 INFO ProgressMeter - Chr10:468095382 2.1
    3714000 1742762.0
    10:38:33.217 INFO ProgressMeter - Chr13:153387370 2.3
    4451000 1937056.2
    10:38:43.228 INFO ProgressMeter - Chr15:247329112 2.5
    5280000 2142263.0
    10:38:53.236 INFO ProgressMeter - Chr17:274706426 2.6
    6082000 2311258.6
    10:39:03.240 INFO ProgressMeter - Chr19:346787344 2.8
    6826000 2439410.8
    10:39:13.244 INFO ProgressMeter - Chr20:561519072 3.0
    7626000 2572064.6
    10:39:19.146 INFO ProgressMeter - Chr21:448338031 3.1
    8182519 2671130.6
    10:39:19.146 INFO ProgressMeter - Traversal complete. Processed 8182519 total v
    ariants in 3.1 minutes.
    10:39:19.165 INFO ValidateVariants - Shutting down engine
    [December 18, 2019 10:39:19 AM CST] org.broadinstitute.hellbender.tools.walkers.
    variantutils.ValidateVariants done. Elapsed time: 3.22 minutes.
    Runtime.totalMemory()=34311503872
    ***********************************************************************

    A USER ERROR has occurred: A GVCF must cover the entire region. Found 1800785837
    loci with no VariantContext covering it. The first uncovered segment is:ChrUN:1
    -849886714

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGAT
    K_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
    Using GATK jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_s
    amtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_leve
    l=2 -Xms32G -Xmx32G -jar ./.conda/envs/gatk_
    env/share/gatk4-4.1.4.1-0/gatk-package-4.1.4.1-local.jar ValidateVariants -R ### -V CC680ANXX_3_cohort.g.vcf --validate-GVCF
  • kaltendokaltendo Minnesota, USAMember
    @bhanuGandham

    I should also mention that I tried combining 2 or 4 of the cohorts at a time instead of all 15. This also failed at Chr 06 across all combinations I tried.

    Thank you for any insight you may have.
  • kaltendokaltendo Minnesota, USAMember
    Changing the file extension to .vcf as opposed to .vcf.gz appears to have solved this issue for me. Thank you so much!
  • emeryjemeryj Member, Broadie

    @kaltendo I am glad that worked! I am going to open a ticket in the gatk to investigate why you were hitting that error. It would make it easier for us to diagnose the issue if we could get some minimal input to reproduce the issue to work with and to see the rest of the stack trace.

  • kaltendokaltendo Minnesota, USAMember
    Here is an example of the command and stacktrace. I am happy to provide the files, just let me know how you'd like me to get them to you.


    # command
    GATK_SETTINGS='-DF NotDuplicateReadFilter -DF MappingQualityAvailableReadFilter'

    gatk --java-options "-Xmx50g" CombineGVCFs -R $REFERENCE \
    --variant CBEEGANXX_1_cohort.g.vcf \
    --variant CBEEGANXX_2_cohort.g.vcf ${GATK_SETTINGS} \
    -L Chr06 \
    -O /scratch.global/####/gatk_temp/NAM_GATK/GenotypeGVCF/test_cohorts/combine_cohorts_test.g.vcf.gz


    # stacktrace
    Using GATK jar /panfs/roc/msisoft/gatk/4.1.2/gatk-package-4.1.2.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_wr
    ite_tribble=false -Dsamjdk.compression_level=2 -Xmx50g -jar /panfs/roc/msisoft/gatk/4.1.2/gatk-package-4.1.2.0-local.
    jar CombineGVCFs -R /home/####/shared/IWG_v1_genome/annotated_v2_release/index/Thinopyrum_intermedium.mainGenome.
    fasta --variant CBEEGANXX_1_cohort.g.vcf --variant CBEEGANXX_2_cohort.g.vcf -DF NotDuplicateReadFilter -DF MappingQua
    lityAvailableReadFilter -L Chr06 -O /scratch.global/####/gatk_temp/NAM_GATK/GenotypeGVCF/test_cohorts/combine_coh
    orts_test.g.vcf.gz
    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/scratch.local/####/gatk_temp
    10:16:52.811 WARN GATKReadFilterPluginDescriptor - Disabled filter (NotDuplicateReadFilter) is not enabled by this t
    ool
    10:16:52.813 WARN GATKReadFilterPluginDescriptor - Disabled filter (MappingQualityAvailableReadFilter) is not enable
    d by this tool
    10:16:53.056 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/panfs/roc/msisoft/gatk/4.1.2/ga
    tk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Dec 20, 2019 10:16:55 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    10:16:55.074 INFO CombineGVCFs - ------------------------------------------------------------
    10:16:55.074 INFO CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.1.2.0
    10:16:55.074 INFO CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    10:16:55.076 INFO CombineGVCFs - Executing as [email protected] on Linux v3.10.0-1062.4.3.el7.x86_64 amd64
    10:16:55.076 INFO CombineGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_202-b08
    10:16:55.076 INFO CombineGVCFs - Start Date/Time: December 20, 2019 10:16:52 AM CST
    10:16:55.076 INFO CombineGVCFs - ------------------------------------------------------------
    10:16:55.076 INFO CombineGVCFs - ------------------------------------------------------------
    10:16:55.077 INFO CombineGVCFs - HTSJDK Version: 2.19.0
    10:16:55.077 INFO CombineGVCFs - Picard Version: 2.19.0
    10:16:55.077 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    10:16:55.077 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    10:16:55.077 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    10:16:55.077 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    10:16:55.077 INFO CombineGVCFs - Deflater: IntelDeflater
    10:16:55.077 INFO CombineGVCFs - Inflater: IntelInflater
    10:16:55.077 INFO CombineGVCFs - GCS max retries/reopens: 20
    10:16:55.077 INFO CombineGVCFs - Requester pays: disabled
    10:16:55.077 INFO CombineGVCFs - Initializing engine
    10:16:55.742 INFO FeatureManager - Using codec VCFCodec to read file file:///scratch.global/####/gatk_temp/NAM_G
    ATK/HaplotypeCaller/NAM_GATK/GenotypeGVCF_attempt2/NAM_GATK/Raw/Combined_GVCFs/CBEEGANXX_1_cohort.g.vcf
    10:16:56.007 INFO FeatureManager - Using codec VCFCodec to read file file:///scratch.global/####/gatk_temp/NAM_G
    ATK/HaplotypeCaller/NAM_GATK/GenotypeGVCF_attempt2/NAM_GATK/Raw/Combined_GVCFs/CBEEGANXX_2_cohort.g.vcf
    10:17:08.102 INFO IntervalArgumentCollection - Processing 570865161 bp from intervals
    10:17:08.127 INFO CombineGVCFs - Done initializing engine
    10:17:08.187 INFO ProgressMeter - Starting traversal
    10:17:08.187 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    10:17:18.485 INFO ProgressMeter - Chr06:16168713 0.2 29000 168964.8
    10:17:28.657 INFO ProgressMeter - Chr06:36474563 0.3 69000 202247.2
    10:17:38.780 INFO ProgressMeter - Chr06:63095194 0.5 112000 219658.1
    10:17:48.906 INFO ProgressMeter - Chr06:87544008 0.7 155000 228400.2
    10:17:58.917 INFO ProgressMeter - Chr06:118639191 0.8 198000 234181.0
    10:18:09.119 INFO ProgressMeter - Chr06:161740380 1.0 242000 238298.4
    10:18:19.250 INFO ProgressMeter - Chr06:200781970 1.2 286000 241475.9
    10:18:29.483 INFO ProgressMeter - Chr06:240418599 1.4 330000 243554.4
    10:18:39.527 INFO ProgressMeter - Chr06:281930395 1.5 373000 245021.3
    10:18:49.604 INFO ProgressMeter - Chr06:318800412 1.7 416000 246112.6
    10:18:59.614 INFO ProgressMeter - Chr06:350679743 1.9 459000 247157.3
    10:19:09.812 INFO ProgressMeter - Chr06:378387256 2.0 503000 248139.8
    10:19:19.828 INFO ProgressMeter - Chr06:413945891 2.2 546000 248858.6
    10:19:29.922 INFO ProgressMeter - Chr06:450192242 2.4 589000 249338.6
    10:19:40.129 INFO ProgressMeter - Chr06:483585145 2.5 632000 249568.9
    10:19:50.177 INFO ProgressMeter - Chr06:503806075 2.7 675000 250017.0
    10:20:00.232 INFO ProgressMeter - Chr06:523496619 2.9 718000 250399.6
    10:20:06.897 INFO CombineGVCFs - Shutting down engine
    [December 20, 2019 10:20:06 AM CST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 3.23
    minutes.
    Runtime.totalMemory()=555745280
    java.lang.ArrayIndexOutOfBoundsException: 32770
    at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:147)
    at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
    at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
    at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:
    177)
    at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:231)
    at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.closeTool(CombineGVCFs.java:495)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1043)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java
    :191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
Sign In or Register to comment.