What is the structure of a GATK command?
This document describes how GATK commands are structured and how to add arguments to basic command examples.
Basic java syntax
Commands for GATK always follow the same basic syntax:
java [Java arguments] -jar GenomeAnalysisTK.jar [GATK arguments]
The core of the command is
java -jar GenomeAnalysisTK.jar, which starts up the GATK program in a Java Virtual Machine (JVM). Any additional java-specific arguments (such as -Xmx to increase memory allocation) should be inserted between
-jar, like this:
java -Xmx4G -jar GenomeAnalysisTK.jar [GATK arguments]
The order of arguments between
-jar is not important.
There are two universal arguments that are required for every GATK command (with very few exceptions, the
-R for Reference (e.g.
-R human_b37.fasta) and
-T for Tool name (e.g.
Additional arguments fall in two categories:
Engine arguments like
-L(for specifying a list of intervals) which can be given to all tools and are technically optional but may be effectively required at certain steps for specific analytical designs (e.g. the
-Largument for calling variants on exomes);
Tool-specific arguments which may be required, like
-I(to provide an input file containing sequence reads to tools that process BAM files) or optional, like
-alleles(to provide a list of known alleles for genotyping).
The ordering of GATK arguments is not important, but we recommend always passing the tool name (
-T) and reference (
-R) first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.
All available engine and tool-specific arguments are listed in the tool documentation section. Arguments typically have both a long name (prefixed by
--) and a short name (prefixed by
-). The GATK command line parser recognizes both equally, so you can use whichever you prefer, depending on whether you prefer commands to be more verbose or more succinct.
Finally, a note about flags. Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example,
--keep_program_records will make certain GATK tools output additional information in the BAM header that would be omitted otherwise. In GATK, all flags are set to FALSE by default, so if you want to set one to TRUE, all you need to do is add the flag name to the command. You don't need to specify an actual value.
Examples of complete GATK command lines
This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing raw variants.
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf
If the data is from exome sequencing, we should additionally provide the exome targets using the
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L exome_intervals.list
If we just want to genotype specific sites of interest using known alleles based on results from a previous study, we can change the HaplotypeCaller’s genotyping mode using
-gt_mode, provide those alleles using
-alleles, and restrict the analysis to just those sites using
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L known_alleles.vcf -alleles known_alleles.vcf -gt_mode GENOTYPE_GIVEN_ALLELES
For more examples of commands and for specific tool commands, see the tool documentation section.