ArrayIndexOutOfBoundsException error in BaseRecalibratorSpark

joe297joe297 AmericaMember
edited May 2017 in GATK 4 Beta

Thank you for your time.
I ran BaseRecalibratorSpark with GATK4 and GATK4-protected on Amazon instance. Both of them gave me error java.lang.ArrayIndexOutOfBoundsException: 1073741865. When running small dataset I didn't get this error, but when I used the real dataset, it appears.

The command I used is gatk4 BaseRecalibratorSpark -I xx_markduplicatespark.bam -knownSites /genome/ref/dbsnp_138.b37.vcf -knownSitesb37.vcf -O xx_baserecalibratespark.table -R /curr/tianj/data/humann_g1k_v37.2bit --TMP_DIR tmp

And here is the error message,

Using GATK wrapper script /curr/tianj/software/gatk/build/install/gatk/bin/gatk
/curr/tianj/software/gatk/build/install/gatk/bin/gatk BaseRecalibratorSpark -I A15_markduplicatespark.bam -knownSites ref/Mills_and_1000G_gold_standard.indels.b37.vcf -O A15_baserecalibratespark.table -R /curr/tianj/data/humann_g1k_v37.2bit
17:19:00.338 INFO NativeLibraryLoader - Loading from jar:file:/curr/tianj/software/gatk/build/
[May 17, 2017 5:19:00 PM UTC] --knownSites /genome/ref/db_1000G_gold_standard.indels.b37.vcf --output A15_baserecalibratespark.table --reference /curr/tianj/data/humann_g1k_v37.2bp --joinStrategy BROADCAST --mismatches_context_size 2 --indels_context_size 3 --maximum_cycle_value 500 --mismatches_defdeletions_default_quality 45 --low_quality_tail 2 --quantizing_levels 16 --bqsrBAQGapOpenPenalty 40.0 --preserve_qscores_lles false --useOriginalQualities false --defaultBaseQualities -1 --readShardSize 10000 --readShardPadding 1000 --readValid-interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --disableSequenceDictionaryValidation false --sharl[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inf
[May 17, 2017 5:19:00 PM UTC] Executing as [email protected] on Linux 4.4.41-36.55.amzn1.x86_64 amd64; Java HotSpot(TM:4.alpha.2-261-gb8d32ee-SNAPSHOT
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.BUFFER_SIZE : 131072
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.COMPRESSION_LEVEL : 1
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CREATE_INDEX : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CREATE_MD5 : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CUSTOM_READER_FACTORY :
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK :
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.REFERENCE_FASTA : null
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
17:19:00.371 INFO BaseRecalibratorSpark - Deflater IntelDeflater
17:19:00.372 INFO BaseRecalibratorSpark - Inflater IntelInflater
17:19:00.372 INFO BaseRecalibratorSpark - Initializing engine
17:19:00.372 INFO BaseRecalibratorSpark - Done initializing engine
17:19:00.872 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java clas
17:22:09.153 INFO BaseRecalibratorSpark - Shutting down engine
[May 17, 2017 5:22:09 PM UTC] done. Elapsed time: 3.15 min
java.lang.ArrayIndexOutOfBoundsException: 1073741865
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(
at com.esotericsoftware.kryo.util.MapReferenceResolver.reset(
at com.esotericsoftware.kryo.Kryo.reset(
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:195)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.broadinstitute.hellbender.engine.spark.BroadcastJoinReadsWithVariants.join(BroadcastJoinReadsWithVariants.j
at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark.add(
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(
at org.broadinstitute.hellbender.Main.runCommandLineProgram(
at org.broadinstitute.hellbender.Main.mainEntry(
at org.broadinstitute.hellbender.Main.main(

Sorry, but I was wrong before. This error appears also in small dataset.

Post edited by joe297 on


  • joe297joe297 AmericaMember

    I can run after I delete the dbsnp_138.b37.vcf. But can anyone tell me the reason why it is happening?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    I just moved your question to the GATK4 category where someone will help you. Keep in mind these tools are still experimental, so we cannot provide full support yet.


  • LouisBLouisB Broad InstituteMember, Broadie, Dev

    Hi @joe297.

    I haven't seen this exact issue before but suspect this is related to which is a known issue with broadcasting large files using spark. If it is a relative of that error, then the current best workaround is probably to upgrade your cluster with kryo 4.0+. This may be infeasible for you though. The awkward workaround is to change the behavior of hashcode by providing the additional spark configuration options --conf spark.executor.extraJavaOptions –XX:hashCode=0
    and --conf spark.driver.extraJavaOptions –XX:hashCode=0.

    Would you mind filing this as an issue on the gatk4 github tracker? (


  • joe297joe297 AmericaMember

    Hi @LouisB

    Sorry, it's not because of the size of the data. It happens also when I use small files. I used both dbsnp138 and Mills golden standard database in the original command when I got these errors. However, after I deleted dbsnp138 database, the command works.

  • cuimiecuimie ChinaMember

    Just share my experience:

    Encountered similar errors while trying to run BQSRPipelineSpark or ReadsPipelineSpark with --knownSites=dbsnp147 (a 3.3 GB vcf.gz file). Then I tried 1000G_phase1 (a 1.8 GB vcf.gz file), failed. Also tried @LouisB suggestion of setting --conf "spark.executor.extraJavaOptions=-XX:hashCode=0" --conf "spark.driver.extraJavaOptions=-XX:hashCode=0", but it ran for 5 hours until I killed it manually -- I think that was unreasonably long for a ~50X human exome data.

    Finally, I used hapmap_3.3.vcf (a 59 MB vcf.gz file), it worked. And Mills_and_1000G_gold_standard.indels (20 MB vcf.gz file) worked too.

    GATK version: 4.alpha.2-1125-g27b5190-SNAPSHOT, and it was built with kryo-3 ( found a jar file: build/install/gatk/lib/kryo-3.0.3.jar)

Sign In or Register to comment.