Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

GATK Staff

Can't run any GATK4 tools on GCS dataproc

dmccabedmccabe BostonMember, Broadie

I'm having a strange issue running GATK4 tools on a dataproc cluster. I'm submitting from a Broad VM with an empty bash profile. As an example, here's what happens when I try to reproduce this tutorial. I'm running these commands from inside my GATK repo, which is the current master branch:

$ use .google-cloud-sdk-98.0.0
$ use Java-1.8
$ gsutil ls -lr gs://gatk-test-data/exome_bam/1000G_wex_hg38/HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam

This shows me the correct file size.

Then, I spin up a dataproc cluster with v1.1 as instructed. I'm able to ssh into the master node and see it's running Java 1.8, as well. The problem occurs when I try to run any command via gatk-launch:

$ ./gatk-launch FlagStatSpark \
    -I gs://gatk-tutorials/how-to/6484_snippet.bam \
    --disableReadFilter WellformedReadFilter \
    -- --sparkRunner GCS --cluster cluster-8ed1

The output from this particular command shows the generated gcloud command and the error I get:

Using GATK jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar
jar caching is disabled because GATK_GCS_STAGING is not set

please set GATK_GCS_STAGING to a bucket you have write access too in order to enable jar caching
add the following line to you .bashrc or equivalent startup script

    export GATK_GCS_STAGING=gs://<my_bucket>/

Replacing spark-submit style args with dataproc style args

--cluster cluster-8ed1 -> --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600

    gcloud dataproc jobs submit spark --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600 --jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar -- FlagStatSpark -I gs://gatk-tutorials/how-to/6484_snippet.bam --disableReadFilter WellformedReadFilter --sparkMaster yarn
Copying file:///xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar [Content-Type=application/octet-stream]...
Uploading   ...827d4-e9bd-470f-b4e6-0b95e5dd676f/gatk-spark.jar: 124.84 MiB/124.84 MiB
Job [7d532aeb-6a3b-4e2e-8b43-187374e33104] submitted.
Waiting for job output...
USAGE:  <program name> [-h]

Available Programs:
Exception in thread "main" org.broadinstitute.hellbender.exceptions.UserException: '--' is not a valid command.

This is the same error you'd get if you ran ./gatk-launch -- instead of an actual tool name. I get this error for any tool name and options I specify.

I can see that the command does get sent to the cluster with -- FlagStatSpark being the first part of the argument:

Why is this happening? Is there something wrong with the GCS dotkit? Has something changed with the GATK?


Issue · Github
by Sheila

Issue Number
Last Updated

Best Answer


  • dmccabedmccabe BostonMember, Broadie

    So, the problem is actually Broad's .google-cloud-sdk dotkit. I installed the latest version (171.0.0) on the VM and everything works. It would be good to specify a minimum version number in the GATK readme and maybe check in the launcher, too.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
Sign In or Register to comment.