The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# (howto) Run the GATK for the first time

edited August 2013

#### NOTICE:

This tutorial is slightly out of date so the output is a little different. We'll update this soon, but in the meantime, don't freak out if you get a result that reads something like

INFO 18:32:38,826 CountReads - CountReads counted 33 reads in the traversal


INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33


You're doing the right thing and getting the right result.

#### Objective

Run a basic analysis command on example data.

#### Steps

1. Invoke the GATK CountReads command
2. Further exercises

### 1. Invoke the GATK CountReads command

A very simple analysis that you can do with the GATK is getting a count of the reads in a BAM file. The GATK is capable of much more powerful analyses, but this is a good starting example because there are very few things that can go wrong.

So we are going to count the reads in the file exampleBAM.bam, which you can find in the GATK resource bundle along with its associated index (same file name with .bai extension), as well as the example reference exampleFASTA.fasta and its associated index (same file name with .fai extension) and dictionary (same file name with .dict extension). Copy them to your working directory so that your directory contents look like this:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai


#### Action

Type the following command:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam


where -T CountReads specifies which analysis tool we want to use, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.

For any analysis that you want to run on a set of aligned reads, you will always need to use at least these three arguments:

• -T for the tool name, which specifices the corresponding analysis

• -R for the reference sequence file

• -I for the input BAM file of aligned reads

They don't have to be in that order in your command, but this way you can remember that you need them if you TRI...

#### Expected Result

After a few seconds you should see output that looks like to this:

INFO  16:17:45,945 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:17:45,946 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO  16:17:45,947 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  16:17:45,947 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO  16:17:45,947 HelpFormatter - Date/Time: 2012/07/25 16:17:45
INFO  16:17:45,947 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:17:45,948 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:17:45,950 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:17:45,982 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:17:45,993 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  16:17:46,060 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33
INFO  16:17:46,061 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours
INFO  16:17:46,100 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO  16:17:46,729 GATKRunReport - Uploaded run statistics report to AWS S3


Depending on the GATK release, you may see slightly different information output, but you know everything is running correctly if you see the line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33


If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK is properly installed.

If you do see this output, congratulations! You just successfully ran you first GATK analysis!

Basically the output you see means that the CountReadsWalker (which you invoked with the command line option -T CountReads) counted 33 reads in the exampleBAM.bam file, which is exactly what we expect to see.

Wait, what is this walker thing?

In the GATK jargon, we call the tools walkers because the way they work is that they walk through the dataset --either along the reference sequence (LocusWalkers), or down the list of reads in the BAM file (ReadWalkers)-- collecting the requested information along the way.

### 2. Further Exercises

Now that you're rocking the read counts, you can start to expand your use of the GATK command line.

Let's say you don't care about counting reads anymore; now you want to know the number of loci (positions on the genome) that are covered by one or more reads. The name of the tool, or walker, that does this is CountLoci. Since the structure of the GATK command is basically always the same, you can simply switch the tool name, right?

#### Action

Instead of this command, which we used earlier:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam


this time you type this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam


See the difference?

#### Result

You should see something like this output:

INFO  16:18:26,183 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:18:26,185 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO  16:18:26,185 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  16:18:26,186 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
INFO  16:18:26,186 HelpFormatter - Date/Time: 2012/07/25 16:18:26
INFO  16:18:26,186 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:18:26,186 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:18:26,189 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:18:26,222 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO  16:18:26,351 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
2052
INFO  16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO  16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO  16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3


Great! But wait -- where's the result? Last time the result was given on this line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33


But this time there is no line that says [REDUCE RESULT]! Is something wrong?

Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen.

#### Action

So we repeat the command, but this time we specify an output file, like this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt


where -o (lowercase o, not zero) is used to specify the output.

#### Result

You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included):

INFO  16:29:15,451 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO  16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt
INFO  16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15
INFO  16:29:15,454 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:29:15,454 HelpFormatter - ---------------------------------------------------------------------------------
INFO  16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO  16:29:15,618 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO  16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO  16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3


This time however, if we look inside the working directory, there is a newly created file there called output.txt.

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r--  1 vdauwera  CHARLES\Domain Users       5 Jul 25 16:29 output.txt


This file contains the result of the analysis:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt
2052


This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.

#### Discussion

Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.

Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.

Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta


the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------


You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.

So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!

There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.

So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• chnPosts: 1Member

thank you for details

• rutgersPosts: 5Member

Hi there, I know that this post was written some time ago and might not been maintained but let's try anyway..
I'm starting to work with GATK and I downloaded the example datafiles at: ftp://ftp.broadinstitute.org/bundle/2.8/ (and unzipped) and followed the instructions but I got an error message that makes me think that there's something going on with the input files (see message at the end).
Are the input files correct on the bundle or did I do something wrong? I read the error page for this problem but couldn't find a clear answer.
thanks a lot
G

~/Downloads/GATKexamples$java -jar$GATK -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO 17:24:58,091 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:24:58,094 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 17:24:58,095 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 17:24:58,101 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO 17:24:58,106 HelpFormatter - Executing as jimp@jimp on Linux 3.13.0-77-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_72-b15.
INFO 17:24:58,107 HelpFormatter - Date/Time: 2016/02/22 17:24:58
INFO 17:24:58,107 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:24:58,107 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:24:58,845 GenomeAnalysisEngine - Strictness is SILENT
INFO 17:24:58,964 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 17:24:58,974 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:24:58,999 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO 17:25:06,160 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ----

Ah, you're not doing anything wrong; this is an overreaction from a new validation check we added in the last version, which has this unintended side effect in some cases. We'll fix it in the next version; in the meantime you can override the check by using -U ALLOW_SEQ_DICT_INCOMPATIBILITY.

Geraldine Van der Auwera, PhD

• FinlandPosts: 2Member

Hey, I have a question about the error: ERROR MESSAGE: Walker requires reads but none were provided. So I ran HaplotypeCaller in gvcf mode for several bams (all lane bams for one sample and the read groups are correct). I used this https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php gvcf example except I had more than one -I sample.bam input. The program seems to run smootly and created good sized files for all of my WGS samples but there is still that error message for at least one of the runs but there is no output file with an alarming size or anything. What can cause 'ERROR MESSAGE: Walker requires reads but none were provided' because all the runs have plenty of reads to run? Should I fix something?

So my exact code is:
java -jar $USERAPPL/GATK_3.6/GenomeAnalysisTK.jar -T HaplotypeCaller -R$ANNOT/Homo_sapiens.GRCh38.dna.primary_assembly.fa $input --emitRefConfidence GVCF -o$outpath/$name.raw.snps.indels.g.vcf where$input is like -I sample1.bam -I sample2.bam -I sampleN.bam
and name comes automatically from the input sample name

@Erru
Hi,

You may be missing the --input_argument.

It looks like you specified the BAM file right after the reference.

-Sheila

• FinlandPosts: 2Member

The $input contains the -I too so the code the program actually sees is like java -jar$USERAPPL/GATK_3.6/GenomeAnalysisTK.jar -T HaplotypeCaller -R $ANNOT/Homo_sapiens.GRCh38.dna.primary_assembly.fa -I sample1.bam -I sample2.bam -I sampleN.bam --emitRefConfidence GVCF -o$outpath/$name.raw.snps.indels.g.vcf or did you mean something else. I made the program print the$input out for me to see so I know it actually always works. But now I figured out that I actually input one null value to the code (for automating the code) so this is what should happen so my mistake...