To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Can't run any GATK4 tools on GCS dataproc

dmccabedmccabe BostonMember, Broadie

I'm having a strange issue running GATK4 tools on a dataproc cluster. I'm submitting from a Broad VM with an empty bash profile. As an example, here's what happens when I try to reproduce this tutorial. I'm running these commands from inside my GATK repo, which is the current master branch:

$ use .google-cloud-sdk-98.0.0
$ use Java-1.8
$ gsutil ls -lr gs://gatk-test-data/exome_bam/1000G_wex_hg38/HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam

This shows me the correct file size.

Then, I spin up a dataproc cluster with v1.1 as instructed. I'm able to ssh into the master node and see it's running Java 1.8, as well. The problem occurs when I try to run any command via gatk-launch:

$ ./gatk-launch FlagStatSpark \
    -I gs://gatk-tutorials/how-to/6484_snippet.bam \
    --disableReadFilter WellformedReadFilter \
    -- --sparkRunner GCS --cluster cluster-8ed1

The output from this particular command shows the generated gcloud command and the error I get:

Using GATK jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar
jar caching is disabled because GATK_GCS_STAGING is not set

please set GATK_GCS_STAGING to a bucket you have write access too in order to enable jar caching
add the following line to you .bashrc or equivalent startup script

    export GATK_GCS_STAGING=gs://<my_bucket>/

Replacing spark-submit style args with dataproc style args

--cluster cluster-8ed1 -> --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600

Running:
    gcloud dataproc jobs submit spark --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600 --jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar -- FlagStatSpark -I gs://gatk-tutorials/how-to/6484_snippet.bam --disableReadFilter WellformedReadFilter --sparkMaster yarn
Copying file:///xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar [Content-Type=application/octet-stream]...
Uploading   ...827d4-e9bd-470f-b4e6-0b95e5dd676f/gatk-spark.jar: 124.84 MiB/124.84 MiB
Job [7d532aeb-6a3b-4e2e-8b43-187374e33104] submitted.
Waiting for job output...
USAGE:  <program name> [-h]

Available Programs:
--------------------------------------------------------------------------------------
<snip>
Exception in thread "main" org.broadinstitute.hellbender.exceptions.UserException: '--' is not a valid command.

This is the same error you'd get if you ran ./gatk-launch -- instead of an actual tool name. I get this error for any tool name and options I specify.

I can see that the command does get sent to the cluster with -- FlagStatSpark being the first part of the argument:

Why is this happening? Is there something wrong with the GCS dotkit? Has something changed with the GATK?

Tagged:

Issue · Github
by Sheila

Issue Number
3649
State
open
Last Updated

Best Answer

Answers

  • dmccabedmccabe BostonMember, Broadie

    So, the problem is actually Broad's .google-cloud-sdk dotkit. I installed the latest version (171.0.0) on the VM and everything works. It would be good to specify a minimum version number in the GATK readme and maybe check in the launcher, too.

    Issue · Github
    by Sheila

    Issue Number
    2505
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
Sign In or Register to comment.