Holiday Notice:
The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!

ArrayIndexOutOfBoundsException error in BaseRecalibratorSpark

joe297joe297 AmericaMember
edited May 2017 in GATK 4 Beta

Hi,
Thank you for your time.
I ran BaseRecalibratorSpark with GATK4 and GATK4-protected on Amazon instance. Both of them gave me error java.lang.ArrayIndexOutOfBoundsException: 1073741865. When running small dataset I didn't get this error, but when I used the real dataset, it appears.

The command I used is gatk4 BaseRecalibratorSpark -I xx_markduplicatespark.bam -knownSites /genome/ref/dbsnp_138.b37.vcf -knownSitesb37.vcf -O xx_baserecalibratespark.table -R /curr/tianj/data/humann_g1k_v37.2bit --TMP_DIR tmp

And here is the error message,


"
Using GATK wrapper script /curr/tianj/software/gatk/build/install/gatk/bin/gatk
Running:
/curr/tianj/software/gatk/build/install/gatk/bin/gatk BaseRecalibratorSpark -I A15_markduplicatespark.bam -knownSites ref/Mills_and_1000G_gold_standard.indels.b37.vcf -O A15_baserecalibratespark.table -R /curr/tianj/data/humann_g1k_v37.2bit
17:19:00.338 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/curr/tianj/software/gatk/build/instabgkl_compression.so
[May 17, 2017 5:19:00 PM UTC] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark --knownSites /genome/ref/db_1000G_gold_standard.indels.b37.vcf --output A15_baserecalibratespark.table --reference /curr/tianj/data/humann_g1k_v37.2bp --joinStrategy BROADCAST --mismatches_context_size 2 --indels_context_size 3 --maximum_cycle_value 500 --mismatches_defdeletions_default_quality 45 --low_quality_tail 2 --quantizing_levels 16 --bqsrBAQGapOpenPenalty 40.0 --preserve_qscores_lles false --useOriginalQualities false --defaultBaseQualities -1 --readShardSize 10000 --readShardPadding 1000 --readValid-interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --disableSequenceDictionaryValidation false --sharl[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inf
[May 17, 2017 5:19:00 PM UTC] Executing as [email protected] on Linux 4.4.41-36.55.amzn1.x86_64 amd64; Java HotSpot(TM:4.alpha.2-261-gb8d32ee-SNAPSHOT
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.BUFFER_SIZE : 131072
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.COMPRESSION_LEVEL : 1
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CREATE_INDEX : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CREATE_MD5 : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.CUSTOM_READER_FACTORY :
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.REFERENCE_FASTA : null
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:19:00.371 INFO BaseRecalibratorSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
17:19:00.371 INFO BaseRecalibratorSpark - Deflater IntelDeflater
17:19:00.372 INFO BaseRecalibratorSpark - Inflater IntelInflater
17:19:00.372 INFO BaseRecalibratorSpark - Initializing engine
17:19:00.372 INFO BaseRecalibratorSpark - Done initializing engine
17:19:00.872 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java clas
17:22:09.153 INFO BaseRecalibratorSpark - Shutting down engine
[May 17, 2017 5:22:09 PM UTC] org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark done. Elapsed time: 3.15 min
Runtime.totalMemory()=15504244736
java.lang.ArrayIndexOutOfBoundsException: 1073741865
at com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382)
at com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65)
at com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:195)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:236)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1310)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:646)
at org.broadinstitute.hellbender.engine.spark.BroadcastJoinReadsWithVariants.join(BroadcastJoinReadsWithVariants.j
at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark.add(AddContextDataToReadSpark.java:67)
at org.broadinstitute.hellbender.tools.spark.BaseRecalibratorSpark.runTool(BaseRecalibratorSpark.java:93)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142)
at org.broadinstitute.hellbender.Main.main(Main.java:220)
"


Sorry, but I was wrong before. This error appears also in small dataset.

Post edited by joe297 on

Answers

  • joe297joe297 AmericaMember

    I can run after I delete the dbsnp_138.b37.vcf. But can anyone tell me the reason why it is happening?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @joe297
    Hi,

    I just moved your question to the GATK4 category where someone will help you. Keep in mind these tools are still experimental, so we cannot provide full support yet.

    -Sheila

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭

    Hi @joe297.

    I haven't seen this exact issue before but suspect this is related to https://github.com/broadinstitute/gatk/issues/1524 which is a known issue with broadcasting large files using spark. If it is a relative of that error, then the current best workaround is probably to upgrade your cluster with kryo 4.0+. This may be infeasible for you though. The awkward workaround is to change the behavior of hashcode by providing the additional spark configuration options --conf spark.executor.extraJavaOptions –XX:hashCode=0
    and --conf spark.driver.extraJavaOptions –XX:hashCode=0.

    Would you mind filing this as an issue on the gatk4 github tracker? (https://github.com/broadinstitute/gatk/issues)?

    Thanks,
    Louis

  • joe297joe297 AmericaMember

    Hi @LouisB

    Sorry, it's not because of the size of the data. It happens also when I use small files. I used both dbsnp138 and Mills golden standard database in the original command when I got these errors. However, after I deleted dbsnp138 database, the command works.

  • cuimiecuimie ChinaMember

    Just share my experience:

    Encountered similar errors while trying to run BQSRPipelineSpark or ReadsPipelineSpark with --knownSites=dbsnp147 (a 3.3 GB vcf.gz file). Then I tried 1000G_phase1 (a 1.8 GB vcf.gz file), failed. Also tried @LouisB suggestion of setting --conf "spark.executor.extraJavaOptions=-XX:hashCode=0" --conf "spark.driver.extraJavaOptions=-XX:hashCode=0", but it ran for 5 hours until I killed it manually -- I think that was unreasonably long for a ~50X human exome data.

    Finally, I used hapmap_3.3.vcf (a 59 MB vcf.gz file), it worked. And Mills_and_1000G_gold_standard.indels (20 MB vcf.gz file) worked too.

    GATK version: 4.alpha.2-1125-g27b5190-SNAPSHOT, and it was built with kryo-3 ( found a jar file: build/install/gatk/lib/kryo-3.0.3.jar)

Sign In or Register to comment.