Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

ArrayIndexOutOfBoundsException error in BaseRecalibratorSpark

joe297joe297 AmericaMember
edited May 2017 in GATK 4 Beta

Hi,
Thank you for your time.
I ran BaseRecalibratorSpark with GATK4 and GATK4-protected on Amazon instance. Both of them gave me error java.lang.ArrayIndexOutOfBoundsException: 1073741865. When running small dataset I didn't get this error, but when I used the real dataset, it appears.

The command I used is gatk4 BaseRecalibratorSpark -I xx_markduplicatespark.bam -knownSites /genome/ref/dbsnp_138.b37.vcf -knownSitesb37.vcf -O xx_baserecalibratespark.table -R /curr/tianj/data/humann_g1k_v37.2bit --TMP_DIR tmp

And here is the error message,


"
Using GATK wrapper script /curr/tianj/software/gatk/build/install/gatk/bin/gatk
Running:
/curr/tianj/software/gatk/build/install/gatk/bin/gatk BaseRecalibratorSpark -I A15_markduplicatespark.bam -knownSites ref/Mills_and_1000G_gold_standard.indels.b37.vcf -O A15_baserecalibratespark.table -R /curr/tianj/data/humann_g1k_v37.2bit
17:19:00.338 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/curr/tianj/software/gatk/build/instabgkl_compression.so
[May 17, 2017 5:19:00 PM UTC] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark --knownSites /genome/ref/db_1000G_gold_standard.indels.b37.vcf --output A15_baserecalibratespark.table --reference /curr/tianj/data/humann_g1k_v37.2bp --joinStrategy BROADCAST --mismatches_context_size 2 --indels_context_size 3 --maximum_cycle_value 500 --mismatches_defdeletions_default_quality 45 --low_quality_tail 2 --quantizing_levels 16 --bqsrBAQGapOpenPenalty 40.0 --preserve_qscores_lles false --useOriginalQualities false --defaultBaseQualities -1 --readShardSize 10000 --readShardPadding 1000 --readValid-interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --disableSequenceDictionaryValidation false --sharl[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inf
[May 17, 2017 5:19:00 PM UTC] Executing as tianj@ip-172-31-78-66 on Linux 4.4.41-36.55.amzn1.x86_64 amd64; Java HotSpot(TM:4.alpha.2-261-gb8d32ee-SNAPSHOT
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.BUFFER_SIZE : 131072
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.COMPRESSION_LEVEL : 1
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CREATE_INDEX : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CREATE_MD5 : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CUSTOM_READER_FACTORY :
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.REFERENCE_FASTA : null
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
17:19:00.371 INFO BaseRecalibratorSpark - Deflater IntelDeflater
17:19:00.372 INFO BaseRecalibratorSpark - Inflater IntelInflater
17:19:00.372 INFO BaseRecalibratorSpark - Initializing engine
17:19:00.372 INFO BaseRecalibratorSpark - Done initializing engine
17:19:00.872 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java clas
17:22:09.153 INFO BaseRecalibratorSpark - Shutting down engine
[May 17, 2017 5:22:09 PM UTC] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark done. Elapsed time: 3.15 min
Runtime.totalMemory()=15504244736
java.lang.ArrayIndexOutOfBoundsException: 1073741865
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382)
at com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65)
at com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:195)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:646)
at org.broadinstitute.hellbender.engine.spark.BroadcastJoinReadsWithVariants.join(BroadcastJoinReadsWithVariants.j
at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark.add(AddContextDataToReadSpark.java:67)
at org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark.runTool(BaseRecalibratorSpark.java:93)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142)
at org.broadinstitute.hellbender.Main.main(Main.java:220)
"


Sorry, but I was wrong before. This error appears also in small dataset.

Post edited by joe297 on

Answers

  • joe297joe297 AmericaMember

    I can run after I delete the dbsnp_138.b37.vcf. But can anyone tell me the reason why it is happening?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @joe297
    Hi,

    I just moved your question to the GATK4 category where someone will help you. Keep in mind these tools are still experimental, so we cannot provide full support yet.

    -Sheila

  • LouisBLouisB Broad InstituteMember, Broadie, Dev

    Hi @joe297.

    I haven't seen this exact issue before but suspect this is related to https://github.com/broadinstitute/gatk/issues/1524 which is a known issue with broadcasting large files using spark. If it is a relative of that error, then the current best workaround is probably to upgrade your cluster with kryo 4.0+. This may be infeasible for you though. The awkward workaround is to change the behavior of hashcode by providing the additional spark configuration options --conf spark.executor.extraJavaOptions –XX:hashCode=0
    and --conf spark.driver.extraJavaOptions –XX:hashCode=0.

    Would you mind filing this as an issue on the gatk4 github tracker? (https://github.com/broadinstitute/gatk/issues)?

    Thanks,
    Louis

  • joe297joe297 AmericaMember

    Hi @LouisB

    Sorry, it's not because of the size of the data. It happens also when I use small files. I used both dbsnp138 and Mills golden standard database in the original command when I got these errors. However, after I deleted dbsnp138 database, the command works.

  • cuimiecuimie ChinaMember

    Just share my experience:

    Encountered similar errors while trying to run BQSRPipelineSpark or ReadsPipelineSpark with --knownSites=dbsnp147 (a 3.3 GB vcf.gz file). Then I tried 1000G_phase1 (a 1.8 GB vcf.gz file), failed. Also tried @LouisB suggestion of setting --conf "spark.executor.extraJavaOptions=-XX:hashCode=0" --conf "spark.driver.extraJavaOptions=-XX:hashCode=0", but it ran for 5 hours until I killed it manually -- I think that was unreasonably long for a ~50X human exome data.

    Finally, I used hapmap_3.3.vcf (a 59 MB vcf.gz file), it worked. And Mills_and_1000G_gold_standard.indels (20 MB vcf.gz file) worked too.

    GATK version: 4.alpha.2-1125-g27b5190-SNAPSHOT, and it was built with kryo-3 ( found a jar file: build/install/gatk/lib/kryo-3.0.3.jar)

Sign In or Register to comment.