If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Workshop walkthrough (Brussels 2014)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

Note: this is a walkthrough of a hands-on GATK tutorial given at the Royal Institute of Natural Sciences on June 26, 2014 in Brussels, Belgium. It is intended to be performed with version 3.1-2 of the GATK and the corresponding data bundle.

Data files

We start with a BAM file called "NA12878.wgs.1lib.bam" (along with its index, "NA12878.wgs.1lib.bai") containing Illumina sequence reads from our favorite test subject, NA12878, that have been mapped using BWA-mem and processed using Picard tools according to the instructions here:

Note that this file only contains sequence for a small region of chromosome 20, in order to minimize the file size and speed up the processing steps, for demonstration purposes. Normally you would run the steps in this tutorial on the entire genome (or exome).

This subsetted file was prepared by extracting read group 20GAV.1 from the CEUTrio.HiSeq.WGS.b37.NA12878.bam that is available in our resource bundle, using the following command:

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I CEUTrio.HiSeq.WGS.b37.NA12878.bam -o NA12878.wgs.1lib.bam -L 20 -rf SingleReadGroup -goodRG 20GAV.1

(We'll explain later in the tutorial how to use this kind of utility function to manipulate BAM files.)

We also have our human genome reference, called "human_g1k_v37.fasta", which has been prepared according to the instructions here:

We will walk through both of these tutorials to explain the processing, but without actually running the steps to save time.

And finally we have a few resource files containing known variants (dbsnp, mills indels). These files are all available in the resource bundle on our FTP server. See here for access instructions:


Prelude: BAM manipulation with Picard and Samtools

- Viewing a BAM file information

See also the Samtools docs:

- Reverting a BAM file

Clean the BAM we are using of previous GATK processing using this Picard command:

java -jar RevertSam.jar I=NA12878.wgs.1lib.bam O=aligned_reads_20.bam RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=false SORT_ORDER=coordinate

Note that it is possible to revert the file to FastQ format by setting REMOVE_ALIGNMENT_INFORMATION=true, but this method leads to biases in the alignment process, so if you want to do that, the better method is to follow the instructions given here:

See also the Picard docs:

Mark Duplicates

See penultimate step of

After a few minutes, the file (which we'll call "dedupped_20.bam") is ready for use with GATK.

Interlude: tour of the documentation, website, forum etc. Also show how to access the bundle on the FTP server with FileZilla.

Getting to know GATK

Before starting to run the GATK Best Practices, we are going to learn about the basic syntax of GATK, how the results are output, how to interpret error messages, and so on.

- Run a simple walker: CountReads

Identify basic syntax, console output: version, command recap line, progress estimates, result if applicable.

java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20

- Add a filter to count how many duplicates were marked

Look at the filtering summary.

java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20 -rf DuplicateRead

- Demonstrate how to select a subset of read data

This can come in handy for bug reports.

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20:10000000-11000000 -o snippet.bam

Also show how a bug report should be formatted and submitted. See

- Demonstrate the equivalent for variant calls

Refer to docs for many other capabilities including selecting by sample name, up to complex queries.

java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta -V dbsnp_b37_20.vcf -o snippet.vcf -L 20:10000000-11000000


GATK Best Practices for data processing (DNA seq)

These steps should typically be performed per lane of data. Here we are running the tools on a small slice of the data, to save time and disk space, but normally you would run on the entire genome or exome. This is especially important for BQSR, which does not work well on small amounts of data.

Now let's pick up where we left off after Marking Duplicates.

- Realign around Indels


java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37 -o target_intervals.list -L 20:10000000-11000000 

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals target_intervals.list -o realigned.bam -L 20:10000000-11000000 

- Base recalibration


java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I realigned_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o recal_20.table -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I realigned_20.bam -BQSR recal_20.table -o recal_20.bam -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I recalibrated_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o post_recal_20.table -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_20.table -after post_recal_20.table -plots recalibration_plots.pdf -L 20:10000000-11000000

GATK Best Practices for variant calling (DNA seq)

- Run HaplotypeCaller in regular mode


java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.vcf -L 20:10000000-11000000

Look at VCF in text and in IGV, compare with bam file.

- Run HaplotypeCaller in GVCF mode (banded and BP_RESOLUTION)


java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.g.vcf -L 20:10000000-11000000 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Compare to regular VCF.



  • mayaabmayaab IsraelMember ✭✭

    since I left right before running HaplotypeCaller in different modes, I would appreciate if can write here some highlights of that part.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Maya,

    We mostly talked about the differences you can observe in how regular VCFs and GVCFs are structured, how to read them (in the text and in IGV), and what are the key argument differences. In the near future we'll try to develop some more detailed text for these tutorials, but to be honest it won't happen immediately as we have a few other priorities to take care of first.

  • mayaabmayaab IsraelMember ✭✭

    Thanks. the workshop was great, and it was nice to meet you all

  • astrandastrand New YorkMember

    Just FYI: there is a typo in the realign indels section when using RealignerTargetCreator: the file Mills_and_1000G_gold_standard.indels.b37 should have a .vcf extension added to its name.

  • astrandastrand New YorkMember

    Also, in the base recalibration step, shouldn't the input file when using PrintReads be realigned.bam as per the previous step (realign indels)? When using IndelRealigner the output file was named realigned.bam.

  • astrandastrand New YorkMember

    I'm sorry to be leaving several comments here but in the base recalibration step, I believe the input file should be recal_20.bam instead of recalibrated_20.bam.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @astrand All your observations are correct - there are a few mistakes in filenames in that document, because it was produced by combining several tutorials that used different file naming conventions.

This discussion has been closed.