Can't run any GATK4 tools on GCS dataproc

dmccabedmccabe BostonMember, Broadie

I'm having a strange issue running GATK4 tools on a dataproc cluster. I'm submitting from a Broad VM with an empty bash profile. As an example, here's what happens when I try to reproduce this tutorial. I'm running these commands from inside my GATK repo, which is the current master branch:

$ use .google-cloud-sdk-98.0.0
$ use Java-1.8
$ gsutil ls -lr gs://gatk-test-data/exome_bam/1000G_wex_hg38/HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam

This shows me the correct file size.

Then, I spin up a dataproc cluster with v1.1 as instructed. I'm able to ssh into the master node and see it's running Java 1.8, as well. The problem occurs when I try to run any command via gatk-launch:

$ ./gatk-launch FlagStatSpark \
    -I gs://gatk-tutorials/how-to/6484_snippet.bam \
    --disableReadFilter WellformedReadFilter \
    -- --sparkRunner GCS --cluster cluster-8ed1

The output from this particular command shows the generated gcloud command and the error I get:

Using GATK jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar
jar caching is disabled because GATK_GCS_STAGING is not set

please set GATK_GCS_STAGING to a bucket you have write access too in order to enable jar caching
add the following line to you .bashrc or equivalent startup script

    export GATK_GCS_STAGING=gs://<my_bucket>/

Replacing spark-submit style args with dataproc style args

--cluster cluster-8ed1 -> --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600

Running:
    gcloud dataproc jobs submit spark --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600 --jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar -- FlagStatSpark -I gs://gatk-tutorials/how-to/6484_snippet.bam --disableReadFilter WellformedReadFilter --sparkMaster yarn
Copying file:///xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar [Content-Type=application/octet-stream]...
Uploading   ...827d4-e9bd-470f-b4e6-0b95e5dd676f/gatk-spark.jar: 124.84 MiB/124.84 MiB
Job [7d532aeb-6a3b-4e2e-8b43-187374e33104] submitted.
Waiting for job output...
USAGE:  <program name> [-h]

Available Programs:
--------------------------------------------------------------------------------------
<snip>
Exception in thread "main" org.broadinstitute.hellbender.exceptions.UserException: '--' is not a valid command.

This is the same error you'd get if you ran ./gatk-launch -- instead of an actual tool name. I get this error for any tool name and options I specify.

I can see that the command does get sent to the cluster with -- FlagStatSpark being the first part of the argument:

Why is this happening? Is there something wrong with the GCS dotkit? Has something changed with the GATK?

Tagged:

Issue · Github
by Sheila

Issue Number
3649
State
open
Last Updated

Best Answer

Answers

  • dmccabedmccabe BostonMember, Broadie

    So, the problem is actually Broad's .google-cloud-sdk dotkit. I installed the latest version (171.0.0) on the VM and everything works. It would be good to specify a minimum version number in the GATK readme and maybe check in the launcher, too.

    Issue · Github
    by Sheila

    Issue Number
    2505
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
Sign In or Register to comment.