Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GATK4 command-line syntax
- Java command basics
- Using the
gatkwrapper script (recommended)
- Adding GATK arguments
- Adding Java arguments
- Adding Spark arguments
- Examples of real commands
1. Java command basics
GATK follows the basic Java command-line syntax:
java -jar program.jar [program arguments]
The core of the command is
java -jar program.jar, which starts up the program in a Java Virtual Machine (JVM).
2. Using the
gatk wrapper script (recommended)
We provide a launch script that encapsulates the
java -jar program.jar part of the command in a single invocation,
gatk. There are several reasons for this that we don't go into in this article (including that there are now two jars included in the package you download), but the upshot is that it makes it possible to add GATK to your PATH variable, and it allows us to build in some autocomplete functionality for convenience.
So the basic command is now:
gatk [program arguments]
3. Adding GATK arguments
The only universally required argument is the name of the GATK tool you want to run. It is a positional argument, so you specify it directly after the
gatk bit, like this:
gatk ToolName [tool arguments]
After the tool name, you can specify any arguments in any order, with the appropriate argument name as follows:
gatk ToolName --argument-name value
Argument naming conventions
The overwhelming majority of argument names follow a "kebab" convention, where the name is prefixed by two dashes (
--) and where applicable, words are separated by single dashes (
-). A minority of very commonly-used arguments accept a short name prefixed by a single dash (
-). The short name is often a single capital letter.
The ordering of GATK arguments is not important, but we recommend passing required arguments first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.
Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example,
--QUIET will suppress some log output. To activate a flag that is set to FALSE by default, all you need to do is add the flag name to the command (no need to specify an actual value). To deactivate a flag that is set to TRUE by default, you need to specify the value as FALSE; for example
--create-output-variant-index FALSE will disable automatic variant indexing.
4. Adding Java arguments
Normally you would insert any java-specific arguments (such as
-Xmx to specify memory allocation) between the
-jar bits of the basic Java command like this:
java -Xmx4G -jar program.jar [program arguments]
When you're using the
gatk wrapper syntax (which we strongly recommend), you have to do it a bit differently, like this:
gatk --java-options "-Xmx4G" [program arguments]
To specify multiple Java arguments, just add them to the quoted string like this:
gatk --java-options "-Xmx4G -XX:+PrintGCDetails" [program arguments]
The order of Java arguments inside the quoted string is not important.
5. Adding Spark arguments
When you run Spark-capable tools, you may need to specify Spark-specific parameters. These must be appended to the end of your GATK command, after a
-- separator, like this:
gatk [GATK arguments] -- [Spark arguments]
6. Examples of real commands
This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing variant calls.
gatk HaplotypeCaller -R reference.fasta -I sample1.bam -O variants.vcf
Now let's switch to running HaplotypeCaller in GVCF mode so that we can add multiple samples to our analysis in a scalable way:
gatk HaplotypeCaller -R reference.fasta -I sample1.bam -O variants.g.vcf -ERC GVCF
We can write this same command on multiple lines to make it more readable by using backslashes at the ends of lines:
gatk HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF
We can add the common Java memory argument
-Xmx like this:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF
If the data is from exome sequencing, we should additionally provide the exome targets using the
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list
Now let's say we want to add a read filter that deals with some problems in our data:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list \ --read-filter OverclippedReadFilter
If we want to reduce the amount of chatter in the logs, we can turn on the
--QUIET setting like this:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list \ --read-filter OverclippedReadFilter \ --QUIET
And finally, if we want to turn off automatic variant index creation:
gatk --java-options "-Xmx4G" HaplotypeCaller \ -R reference.fasta \ -I sample1.bam \ -O variants.g.vcf \ -ERC GVCF \ -L exome_intervals.list \ --read-filter OverclippedReadFilter \ --QUIET \ --create-output-variant-index FALSE
For more examples of commands and for specific tool command recommendations, see the tool documentation section.