If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Jar caching is a thing you can do to speed up the process of running Spark tools on Google Dataproc. Normally, the GATK engine will start by uploading a copy of the GATK jar file to the cloud, because that's the jar that will actually run the command you want to execute on the cloud. (GATK is basically cloning itself and telling the clone to do its homework.) However, it's a big jar; it can take a while to upload (depending on the speed of your connection), and it's really inefficient to have to do that every single time you want to kick off a run. The good news is you can bypass that by "caching a jar", which means that you store a copy of the jar in a Google bucket (in GCS), and you just tell GATK to use that instead of uploading a fresh copy.
To enable this, just set an environment variable named
GATK_GCS_STAGING pointing to the location where you want the GATK jar to be cached. If you're using bash, you would add to your bashrc (or equivalent):
export GATK_GCS_STAGING=gs://your_bucket_name/some_path/. You may need to refresh your terminal session for the variable to become available.
When you launch your next GATK command to Dataproc, the system will check if the jar you are invoking matches a jar in that location. If not, it will upload a copy of your jar to that location, then proceed with running your command. If it does find a matching jar, then it will skip the upload step and start running your command right away. That means you don't need to copy the jar manually at any point, and starting the second time you run a command against the same jar, you'll save yourself the upload time.
Be sure to include the forward slash at the end of the bucket path and use a dedicated directory, e.g.
gatk4. Because each GATK4 jar that we release will have a different identifying hash, as you upgrade to each latest release, the jars for the different versions will accumulate. You can keep them around if you like (though you will incur storage costs which, while minimal, will be non-zero) or you can delete them once you're satisfied you won't be running older versions again.