We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Can't run CreateReadCountPanelOfNormals in cluster mode

obigbandoobigbando Taiwan, TaipeiMember
edited May 2019 in Ask the GATK team


we found tool CreateReadCountPanelOfNormals, which doesn't has Spark suffix, is actually a spark job running with spark master setting of local[ * ]. The tool works perfectly when we run it using default local[ * ] setting, with sample command:

$ java -jar /root/gatk- CreateReadCountPanelOfNormals \
    --arguments_file /tmp/CreateReadCountPanelOfNormals/50-input.list \
    --annotated-intervals /tmp/CreateReadCountPanelOfNormals/50-annotate.tsv \
    -O /tmp/CreateReadCountPanelOfNormals/50-0.hdf5

but we can't run it with --spark-master option. We're wondering whether we can run CreateReadCountPanelOfNormals as a normal spark job? If so, how should we run it? Thanks.


What we've tried:

1) using java -jar command:

$ java -jar /root/gatk- CreateReadCountPanelOfNormals \
    --arguments_file /tmp/CreateReadCountPanelOfNormals/50-input.list \
    --annotated-intervals /tmp/CreateReadCountPanelOfNormals/50-annotate.tsv \
    -O /tmp/CreateReadCountPanelOfNormals/50-0.hdf5  \
    --spark-master spark://head:7077

with the following error:
java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2783)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1605)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

2) using spark submit command:
It seems like the tool doesn't recognize hdfs path, while it doesn't make sense to pass local file path to a spark job running in cluster mode. We tried both gatk-package- and gatk-package-, both ended with same file not found error.

/usr/local/spark//bin/spark-submit --master spark://head:7077 \
     --conf spark.driver.userClassPathFirst=false \
     --conf spark.io.compression.codec=lzf \
     --conf spark.driver.maxResultSize=0 \
     --conf "spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"  \
     --conf "spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"  \
     --conf spark.kryoserializer.buffer.max=512m \
     --conf spark.yarn.executor.memoryOverhead=600 /usr/local/seqslab/gatk- \
     CreateReadCountPanelOfNormals \
    --arguments_file hdfs://tmp/CreateReadCountPanelOfNormals/50-input.list \
    --annotated-intervals hdfs://tmp/CreateReadCountPanelOfNormals/50-annotate.tsv \
    -O hdfs://tmp/CreateReadCountPanelOfNormals/50-0.hdf5

with the following error:
Caused by: java.io.FileNotFoundException: hdfs:/tmp/CreateReadCountPanelOfNormals/50-input.list (No such file or directory)

Best Answer


  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @obigbando ,

    Please pass along the full stack trace from submitted job, this will help us get a clearer picture of the problem. You should be able to attach it as a file in your next post.

    You can try rerunning your second command with 50-input.list file in a regular file system instead of a hdfs file system.

  • cnormancnorman United StatesMember, Broadie, Dev ✭✭

    Yes, it looks like this is a limitation in the implementation of arguments_file expansion, which can't handle a file name for a file on an installed file system like hdfs. Corresponding ticket is here

  • obigbandoobigbando Taiwan, TaipeiMember


    Follow your suggestion, I tried 2nd command with all input files in local file system, and the command ended in success. But I checked our spark master and found there is actually no job submitted there and the job seemed to be run in the local mode, even though --master argument has been set. Same behavior also occurred when we run the exact sample command with gatk-package-

        --master spark://head:7077
        --conf spark.driver.userClassPathFirst=false
        --conf spark.io.compression.codec=lzf
        --conf spark.driver.maxResultSize=0
        --conf "spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"
        --conf "spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"
        --conf spark.kryoserializer.buffer.max=512m
        --conf spark.yarn.executor.memoryOverhead=600 /usr/local/seqslab/gatk-
        --arguments_file /seqslab/tmp/20b58d1e-2b71-4f89-8093-d424603421e3/CreateReadCountPanelOfNormals/50-input.list
        --annotated-intervals /seqslab/tmp/20b58d1e-2b71-4f89-8093-d424603421e3/CreateReadCountPanelOfNormals/50-annotate.tsv
    -O /seqslab/tmp/20b58d1e-2b71-4f89-8093-d424603421e3/CreateReadCountPanelOfNormals/50-0.hdf5

    So it seems to me that CreateReadCountPanelOfNormals is a local tool implemented with spark framework, and we should simply run it as a local command, and don't try to run it as a regular spark job.

    @cnorman If that is the case, there would be no need for arguments_file to take hdfs file.

  • obigbandoobigbando Taiwan, TaipeiMember

    @slee Got it. Thanks for help. So as @bshifaw and @cnorman .

    What we are doing is to run gatk4 applications on spark. For local tool like CreateReadCountPanelOfNormals, we will use our framework showing in deepvariant-on-spark, in which bam files will be transformed into parquet files on the HDFS, and then be partitioned and dispatched to multiple deepvariant processes under spark framework.


    In the case of CreateReadCountPanelOfNormals, we will be in a use case of run a spark[*] application in each of the partitions of a cluster mode spark job. It is kind of weird, and that is the reason why we try to run it as a regular spark job to avoid this weird situation.

Sign In or Register to comment.