Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Can't run CreateReadCountPanelOfNormals in cluster mode 4.1.0.0

obigbandoobigbando Taiwan, TaipeiMember
edited May 9 in Ask the GATK team

hi,

we found tool CreateReadCountPanelOfNormals, which doesn't has Spark suffix, is actually a spark job running with spark master setting of local[ * ]. The tool works perfectly when we run it using default local[ * ] setting, with sample command:

$ java -jar /root/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateReadCountPanelOfNormals \
    --arguments_file /tmp/CreateReadCountPanelOfNormals/50-input.list \
    --annotated-intervals /tmp/CreateReadCountPanelOfNormals/50-annotate.tsv \
    -O /tmp/CreateReadCountPanelOfNormals/50-0.hdf5

but we can't run it with --spark-master option. We're wondering whether we can run CreateReadCountPanelOfNormals as a normal spark job? If so, how should we run it? Thanks.

=============================================

What we've tried:

1) using java -jar command:

$ java -jar /root/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateReadCountPanelOfNormals \
    --arguments_file /tmp/CreateReadCountPanelOfNormals/50-input.list \
    --annotated-intervals /tmp/CreateReadCountPanelOfNormals/50-annotate.tsv \
    -O /tmp/CreateReadCountPanelOfNormals/50-0.hdf5  \
    --spark-master spark://head:7077

with the following error:
java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2783)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1605)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

2) using spark submit command:
It seems like the tool doesn't recognize hdfs path, while it doesn't make sense to pass local file path to a spark job running in cluster mode. We tried both gatk-package-4.1.0.0-spark.jar and gatk-package-4.1.0.0-local.jar, both ended with same file not found error.

/usr/local/spark//bin/spark-submit --master spark://head:7077 \
     --conf spark.driver.userClassPathFirst=false \
     --conf spark.io.compression.codec=lzf \
     --conf spark.driver.maxResultSize=0 \
     --conf "spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"  \
     --conf "spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"  \
     --conf spark.kryoserializer.buffer.max=512m \
     --conf spark.yarn.executor.memoryOverhead=600 /usr/local/seqslab/gatk-4.1.0.0/gatk-package-4.1.0.0-spark.jar \
     CreateReadCountPanelOfNormals \
    --arguments_file hdfs://tmp/CreateReadCountPanelOfNormals/50-input.list \
    --annotated-intervals hdfs://tmp/CreateReadCountPanelOfNormals/50-annotate.tsv \
    -O hdfs://tmp/CreateReadCountPanelOfNormals/50-0.hdf5

with the following error:
Caused by: java.io.FileNotFoundException: hdfs:/tmp/CreateReadCountPanelOfNormals/50-input.list (No such file or directory)

Best Answer

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @obigbando ,

    Please pass along the full stack trace from submitted job, this will help us get a clearer picture of the problem. You should be able to attach it as a file in your next post.

    You can try rerunning your second command with 50-input.list file in a regular file system instead of a hdfs file system.

  • cnormancnorman United StatesMember, Broadie, Dev ✭✭

    Yes, it looks like this is a limitation in the implementation of arguments_file expansion, which can't handle a file name for a file on an installed file system like hdfs. Corresponding ticket is here

  • obigbandoobigbando Taiwan, TaipeiMember

    @bshifaw

    Follow your suggestion, I tried 2nd command with all input files in local file system, and the command ended in success. But I checked our spark master and found there is actually no job submitted there and the job seemed to be run in the local mode, even though --master argument has been set. Same behavior also occurred when we run the exact sample command with gatk-package-4.1.0.0-local.jar.

     /usr/local/spark//bin/spark-submit
        --master spark://head:7077
        --conf spark.driver.userClassPathFirst=false
        --conf spark.io.compression.codec=lzf
        --conf spark.driver.maxResultSize=0
        --conf "spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"
        --conf "spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2"
        --conf spark.kryoserializer.buffer.max=512m
        --conf spark.yarn.executor.memoryOverhead=600 /usr/local/seqslab/gatk-4.1.0.0/gatk-package-4.1.0.0-spark.jar
        CreateReadCountPanelOfNormals
        --arguments_file /seqslab/tmp/20b58d1e-2b71-4f89-8093-d424603421e3/CreateReadCountPanelOfNormals/50-input.list
        --annotated-intervals /seqslab/tmp/20b58d1e-2b71-4f89-8093-d424603421e3/CreateReadCountPanelOfNormals/50-annotate.tsv
    -O /seqslab/tmp/20b58d1e-2b71-4f89-8093-d424603421e3/CreateReadCountPanelOfNormals/50-0.hdf5
    

    So it seems to me that CreateReadCountPanelOfNormals is a local tool implemented with spark framework, and we should simply run it as a local command, and don't try to run it as a regular spark job.

    @cnorman If that is the case, there would be no need for arguments_file to take hdfs file.

  • obigbandoobigbando Taiwan, TaipeiMember

    @slee Got it. Thanks for help. So as @bshifaw and @cnorman .

    What we are doing is to run gatk4 applications on spark. For local tool like CreateReadCountPanelOfNormals, we will use our framework showing in deepvariant-on-spark, in which bam files will be transformed into parquet files on the HDFS, and then be partitioned and dispatched to multiple deepvariant processes under spark framework.

    https://github.com/atgenomix/deepvariant-on-spark

    In the case of CreateReadCountPanelOfNormals, we will be in a use case of run a spark[*] application in each of the partitions of a cluster mode spark job. It is kind of weird, and that is the reason why we try to run it as a regular spark job to avoid this weird situation.

Sign In or Register to comment.