We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

running spark enable tools on a cluster

Hi, I'm having difficulties running the spark tools on my spark cluster. I'm thinking we have a version mis-match, but I can't seem to determine where it is. I've tried both the HaplotypeCallerSpark & PrintReadsSpark tools, each works running with the embedded spark. Our spark cluster is running spark version 2.4.2 with scala version 2.12 and hadoop 3.1.2.
My command lines are (respectively):
./gatk HaplotypeCallerSpark -R hdfs://NameNode-1:54310/gatk-tutorial/gatk-workflows/inputs/exome_Homo_sapiens_assembly18.fasta -I hdfs://NameNode-1:54310/gatk-tutorial/gatk-workflows/inputs/exome_NA12878.ga2.exome.maq.raw.bam -O hdfs://NameNode-1:54310/gatk-tutorial/outputs/HaplottypeCallerSpark_output.vcf -- --spark-runner SPARK --spark-master spark://master:7077 --num-executors 4 --executor-cores 4

./gatk PrintReadsSpark -I file:///tmp/NA12878_24RG_small.hg38.bam -O file:///tmp/OutputPrintReadsSpark --spark-verbosity DEBUG -- --spark-runner SPARK --spark-master spark://master:7077 --num-executors 4 --executor-cores 4

and my error in both cases is the same:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.avro.Schema$Parser.parse(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/avro/Schema;
at org.bdgenomics.formats.avro.Variant.<clinit>(Variant.java:18)
at sun.misc.Unsafe.ensureClassInitialized(Native Method)
at sun.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(UnsafeFieldAccessorFactory.java:43)
at sun.reflect.ReflectionFactory.newFieldAccessor(ReflectionFactory.java:156)
at java.lang.reflect.Field.acquireFieldAccessor(Field.java:1088)
at java.lang.reflect.Field.getFieldAccessor(Field.java:1069)
at java.lang.reflect.Field.get(Field.java:393)
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:205)
at org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:154)
at org.apache.avro.specific.SpecificDatumReader.<init>(SpecificDatumReader.java:32)
at org.bdgenomics.adam.serialization.AvroSerializer.<init>(ADAMKryoRegistrator.scala:43)
at org.bdgenomics.adam.models.VariantContextSerializer.<init>(VariantContext.scala:94)
at org.bdgenomics.adam.serialization.ADAMKryoRegistrator.registerClasses(ADAMKryoRegistrator.scala:194)
at org.broadinstitute.hellbender.engine.spark.GATKRegistrator.registerClasses(GATKRegistrator.java:104)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$7(KryoSerializer.scala:136)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$7$adapted(KryoSerializer.scala:136)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:136)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:324)
at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:309)
at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:218)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:288)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
at org.apache.spark.rdd.NewHadoopRDD.<init>(NewHadoopRDD.scala:79)
at org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1160)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1146)
at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:478)
at org.disq_bio.disq.impl.file.PathSplitSource.getPathSplits(PathSplitSource.java:96)
at org.disq_bio.disq.impl.formats.bgzf.BgzfBlockSource.getBgzfBlocks(BgzfBlockSource.java:66)
at org.disq_bio.disq.impl.formats.bam.BamSource.getPathChunks(BamSource.java:125)
at org.disq_bio.disq.impl.formats.sam.AbstractBinarySamSource.getReads(AbstractBinarySamSource.java:86)
at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:162)
at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:123)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:185)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:562)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:541)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:531)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Am I issuing the commands incorrectly or is there an issue with versions? If versions, what needs to change?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • dcrespidcrespi Member
    It appears that gatk4 is having some sort of conflict with another application I have. In my spark-defaults.conf file, I include these options:
    spark.driver.extraClassPath & spark.executor.extraClassPath
    and when the path includes a wildcard, it fail with the error above.

    Example:
    spark.driver.extraClassPath /Path/To/Jars/*:.
    spark.executor.extraClassPath /Path/To/Jars/*:.

    Experimenting around with it... it doesn't matter what the path is, even to an empty directory, as soon at the "*" is added, gatk4 fails.

    Any thoughts about that?

    Thanks,
    david
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @dcrespi

    Can you please explain what you mean by It appears that gatk4 is having some sort of conflict with another application? Which other application are you using and how are you using it with gatk?

  • dcrespidcrespi Member
    If you read my comment above, it's doesn't matter what other application is there. When I have the wildcard (*) in my spark-defaults.conf file, gatk fails... even if it's pointing to an empty directory. There is something going on with how the classpath is interpreted I think. i'm still working through this, but from what I've seen so far, it looks like the gatk spark jar isn't provided after the * is given. At least I don't see it in the logs any longer. If the wildcard isn't provided, then I see it included in the logs.
  • dcrespidcrespi Member
    So I take back what I just said. I just created another empty directory (/Foo/jars) and make it look (permissions and owner) like the one I had in my spark-defaults.conf file and it worked. If I blow away the original path that was in my spark-defaults.conf file, and just re-add that path... it fails again. Is there a cache somewhere that is remembering the contents of that path??
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @dcrespi

    I am not sure about caching in spark. This question might be better suited for biostars or seqanswers.

  • dcrespidcrespi Member
    I wasn't referring to spark. Would you happen to know if gatk4 has been built and tested with Scala 2.12? I've rebuilt it, and updated as best I could, but I'm wondering if that's where my issues are coming from. My Spark and hadoop are all on scala 2.12.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @dcrespi

    I checked with the dev team and they suggested that if you want to build a gatk jar that works with scala 12 you have to set the environment variable SCALA_VERSION=2.12 and build it yourself. Can you try that and and let me know how that works out for you.

  • dcrespidcrespi Member
    Well, that's exactly what I did. It appears that there are other things in the build file that also rely on scala version, but these aren't getting changed with the scala_version variable. That's why I was wondering if there was an updated build based on scala 2.12.
  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭✭

    Hi @dcrespi,

    I'm sorry you're having trouble with this. We currently run a subset of our tests using a combination fo java 11 and spark 2.12. In our test system everything passes, but it's a very limited set of environments and often doesn't find issues that you encounter on a real cluster. I think there are currently some packaging errors when building the sparkJar due to specific excludes of spark 2.11 modules and not 2.12. I'm trying to figure out a good build for the sparkJar but hitting a few issues. I'll let you know when I have a better solution for you to try.

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭✭

    So I have 2 things to try.

    1. I gave @bhanuGandham bad advice when she asked me how to build for 2.12. The SCALA_VERSION=2.12 is what we set in our test environment, but it doesn't get picked up automatically. You have to set the system property -Dscala.version=2.12 when running gradle.

      i.e. ./gradlew -Dscala.version=2.12 sparkJar

    2. I think we're currently accidentally packaging a copy of spark into the sparkjar when you build with 2.12. This causes the sort of errors that you're seeing. I have a branch lb_fix_spark_packaging which I think may fix the problem.

    So I recommend you first make sure you were actually building for 2.12 correctly and then you try with that custom branch. Let me know if it works for you.

    Scala 2.12 support should be considered beta at best so you may encounter additional issues. Please report any that you do encounter.

  • dcrespidcrespi Member
    Neither of the options worked.
    Option 1:
    using build 4.1.4.0-17-ge565ab217 with the option you provided for building, and the generic gradle.build file, causes this error when running --SPARK-RUNNER LOCAL: Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class

    Option 2:
    using the new repo clone with the generic gradle.build file runs with --SPARK-RUNNER LOCAL, but fails with --SPARK-RUNNER SPARK with the error: Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class

    I assume because my spark cluster is using 2.12. I also tried to build the new repo with the same option to gradlew, but I still get the error.

    If I take the original repo, and change the build.gradle to the following (build.gradle_new), then
    I can run with the option --SPARK-RUNNER SPARK and use my cluster, but using the --SPARK-RUNNER LOCAL fails with: Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class

    Diff of the two files (_org=original build file; _new=modified)
    /tmp/gatk$ diff build.gradle_org build.gradle_new
    63,65c63,65
    < final sparkVersion = System.getProperty('spark.version', '2.4.3')
    < final scalaVersion = System.getProperty('scala.version', '2.11')
    < final hadoopVersion = System.getProperty('hadoop.version', '2.8.2')
    ---
    > final sparkVersion = System.getProperty('spark.version', '2.4.2')
    > final scalaVersion = System.getProperty('scala.version', '2.12')
    > final hadoopVersion = System.getProperty('hadoop.version', '3.1.2')
    217c217
    < exclude module: 'spark-core_2.11'
    ---
    > exclude module: 'spark-core_2.12'
    223c223
    < exclude module: 'spark-mllib_2.11'
    ---
    > exclude module: 'spark-mllib_2.12'
    247,248c247,248
    < exclude module: 'spark-core_2.11'
    < exclude module: 'spark-sql_2.11'
    ---
    > exclude module: 'spark-core_2.12'
    > exclude module: 'spark-sql_2.12'
    299,300c299,300
    < compile 'org.bdgenomics.bdg-formats:bdg-formats:0.5.0'
    < compile('org.bdgenomics.adam:adam-core-spark2_' + scalaVersion + ':0.28.0') {
    ---
    > compile 'org.bdgenomics.bdg-formats:bdg-formats:0.14.0'
    > compile('org.bdgenomics.adam:adam-core-spark2_' + scalaVersion + ':0.29.0') {

    So I can can at least run on my spark cluster with the changes to the build file shown above.

    Going back to the original problem that I posted, I have to remove from my classpath, of a different application, the file: avro-1.7.4.jar, for it to run. The gatk app wants to find org.bdgenomics.formats.avro.Variant, which has the avro-X.jar as it's dependency, from one of the things that gets loaded as it's dependent. My version is avro-1.7.4.jar. So if I put gatk first in the classpath, it loads the org.bdgenomics.formats and not my apps version, then my app fails. And if I put gatk second, it loads my app's version first, then gatk fails with the unknown method call.

    So how do I get around this???
  • dcrespidcrespi Member
    Quick update... I've rebuilt my entire environment now to all use scala 2.11. I'm still getting the exact same error (Exception in thread "main" java.lang.NoSuchMethodError: org.apache.avro.Schema$Parser.parse(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/avro/Schema;
    at org.bdgenomics.formats.avro.Variant.<clinit>(Variant.java:18))
    unless I comment my app out of the classpath.
  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭✭

    I'm not sure exactly what's going on. Are you running your app alongside GATK as part of a single spark pipeline?

    ADAM is a very minimally used dependency in GATK unless you are loading ADAM files. You could try building a version of GATK that doesn't register ADAM classes on startup. I put up a branch here that should log an error instead of falling over if ADAM registration fails although I'm not sure it will definitely fix your problems.

    If you're running with hadoop 3 I've seen some other issues if you don't specify a newer jackson library.
    Adding this explicit dependency to your build.gradle could help.

        compile('com.fasterxml.jackson.module:jackson-module-scala_' + scalaVersion + ':2.9.8')
    
  • dcrespidcrespi Member
    Well, more digging. I found that my version of spark and hadoop have the 1.74 avro file included in those builds. I changed my builds to on where the avro 1.77 version was included, and that got rid of this error!
Sign In or Register to comment.