We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Jar caching and enabling it

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited August 2017 in Dictionary

When running GATK4 Spark jobs, we see in the standard output a message about caching the jar file.

Using GATK jar /Applications/genomicstools/gatk/gatk-4.latest/gatk-package-4.beta.2-spark.jar
jar caching is disabled because GATK_GCS_STAGING is not set

please set GATK_GCS_STAGING to a bucket you have write access too in order to enable jar caching
add the following line to you .bashrc or equivalent startup script

    export GATK_GCS_STAGING=gs://<my_bucket>/

Instead of uploading the local jar each time you run a command, it is possible to cache the jar in a run to a cloud bucket. To enable this, you will have to export, i.e. add to your bashrc, GATK_GCS_STAGING=gs://your_bucket_name/some_folder_name/. Gatk-launch will check if the jar you are invoking matches the jar in this folder and, if they match, will copy this jar from google storage to the google cluster instead of your local system. Here is an example export command.

export GATK_GCS_STAGING=gs://spacecade7/gatk4/

Uploads to Google cloud buckets are free though. So this feature is advantageous for situations in which you have limited or slow network connections. And we know uploads are generally much slower than downloads.

Remember to include the forward slash at the end of the bucket path and use a dedicated directory, e.g. gatk4. Because each GATK4 release’s jar will have a different identifying hash, as you upgrade to each latest release, different versioned jars will start to accumulate and otherwise litter your bucket.

Sign In or Register to comment.