The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!
Jar caching is a thing you can do to speed up the process of running Spark tools on Google Dataproc. Normally, the GATK engine will start by uploading a copy of the GATK jar file to the cloud, because that's the jar that will actually run the command you want to execute on the cloud. (GATK is basically cloning itself and telling the clone to do its homework.) However, it's a big jar; it can take a while to upload (depending on the speed of your connection), and it's really inefficient to have to do that every single time you want to kick off a run. The good news is you can bypass that by "caching a jar", which means that you store a copy of the jar in a Google bucket (in GCS), and you just tell GATK to use that instead of uploading a fresh copy.
To enable this, just set an environment variable named
GATK_GCS_STAGING pointing to the location where you want the GATK jar to be cached. If you're using bash, you would add to your bashrc (or equivalent):
export GATK_GCS_STAGING=gs://your_bucket_name/some_path/. You may need to refresh your terminal session for the variable to become available.
When you launch your next GATK command to Dataproc, the system will check if the jar you are invoking matches a jar in that location. If not, it will upload a copy of your jar to that location, then proceed with running your command. If it does find a matching jar, then it will skip the upload step and start running your command right away. That means you don't need to copy the jar manually at any point, and starting the second time you run a command against the same jar, you'll save yourself the upload time.
Be sure to include the forward slash at the end of the bucket path and use a dedicated directory, e.g.
gatk4. Because each GATK4 jar that we release will have a different identifying hash, as you upgrade to each latest release, the jars for the different versions will accumulate. You can keep them around if you like (though you will incur storage costs which, while minimal, will be non-zero) or you can delete them once you're satisfied you won't be running older versions again.