Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Workshop walkthrough (Brussels 2014)

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,856Administrator, GATK Developer admin

Note: this is a walkthrough of a hands-on GATK tutorial given at the Royal Institute of Natural Sciences on June 26, 2014 in Brussels, Belgium. It is intended to be performed with version 3.1-2 of the GATK and the corresponding data bundle.

Data files

We start with a BAM file called "NA12878.wgs.1lib.bam" (along with its index, "NA12878.wgs.1lib.bai") containing Illumina sequence reads from our favorite test subject, NA12878, that have been mapped using BWA-mem and processed using Picard tools according to the instructions here:

http://www.broadinstitute.org/gatk/guide/article?id=2799

Note that this file only contains sequence for a small region of chromosome 20, in order to minimize the file size and speed up the processing steps, for demonstration purposes. Normally you would run the steps in this tutorial on the entire genome (or exome).

This subsetted file was prepared by extracting read group 20GAV.1 from the CEUTrio.HiSeq.WGS.b37.NA12878.bam that is available in our resource bundle, using the following command:

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I CEUTrio.HiSeq.WGS.b37.NA12878.bam -o NA12878.wgs.1lib.bam -L 20 -rf SingleReadGroup -goodRG 20GAV.1

(We'll explain later in the tutorial how to use this kind of utility function to manipulate BAM files.)

We also have our human genome reference, called "human_g1k_v37.fasta", which has been prepared according to the instructions here:

http://www.broadinstitute.org/gatk/guide/article?id=2798

We will walk through both of these tutorials to explain the processing, but without actually running the steps to save time.

And finally we have a few resource files containing known variants (dbsnp, mills indels). These files are all available in the resource bundle on our FTP server. See here for access instructions:

http://www.broadinstitute.org/gatk/guide/article?id=1215


DAY 1

Prelude: BAM manipulation with Picard and Samtools

- Viewing a BAM file information

See also the Samtools docs:

http://samtools.sourceforge.net/samtools.shtml

- Reverting a BAM file

Clean the BAM we are using of previous GATK processing using this Picard command:

java -jar RevertSam.jar I=NA12878.wgs.1lib.bam O=aligned_reads_20.bam RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=false SORT_ORDER=coordinate

Note that it is possible to revert the file to FastQ format by setting REMOVE_ALIGNMENT_INFORMATION=true, but this method leads to biases in the alignment process, so if you want to do that, the better method is to follow the instructions given here:

http://www.broadinstitute.org/gatk/guide/article?id=2908

See also the Picard docs:

http://picard.sourceforge.net/command-line-overview.shtml

Mark Duplicates

See penultimate step of http://www.broadinstitute.org/gatk/guide/article?id=2799

After a few minutes, the file (which we'll call "dedupped_20.bam") is ready for use with GATK.

Interlude: tour of the documentation, website, forum etc. Also show how to access the bundle on the FTP server with FileZilla.

Getting to know GATK

Before starting to run the GATK Best Practices, we are going to learn about the basic syntax of GATK, how the results are output, how to interpret error messages, and so on.

- Run a simple walker: CountReads

Identify basic syntax, console output: version, command recap line, progress estimates, result if applicable.

java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20

- Add a filter to count how many duplicates were marked

Look at the filtering summary.

java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20 -rf DuplicateRead

- Demonstrate how to select a subset of read data

This can come in handy for bug reports.

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20:10000000-11000000 -o snippet.bam

Also show how a bug report should be formatted and submitted. See http://www.broadinstitute.org/gatk/guide/article?id=1894

- Demonstrate the equivalent for variant calls

Refer to docs for many other capabilities including selecting by sample name, up to complex queries.

java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta -V dbsnp_b37_20.vcf -o snippet.vcf -L 20:10000000-11000000

See http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html


GATK Best Practices for data processing (DNA seq)

These steps should typically be performed per lane of data. Here we are running the tools on a small slice of the data, to save time and disk space, but normally you would run on the entire genome or exome. This is especially important for BQSR, which does not work well on small amounts of data.

Now let's pick up where we left off after Marking Duplicates.

- Realign around Indels

See http://gatkforums.broadinstitute.org/discussion/2800/howto-perform-local-realignment-around-indels

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37 -o target_intervals.list -L 20:10000000-11000000 

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals target_intervals.list -o realigned.bam -L 20:10000000-11000000 

- Base recalibration

See http://gatkforums.broadinstitute.org/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I realigned_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o recal_20.table -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I realigned_20.bam -BQSR recal_20.table -o recal_20.bam -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I recalibrated_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o post_recal_20.table -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_20.table -after post_recal_20.table -plots recalibration_plots.pdf -L 20:10000000-11000000

GATK Best Practices for variant calling (DNA seq)

- Run HaplotypeCaller in regular mode

See http://www.broadinstitute.org/gatk/guide/article?id=2803

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.vcf -L 20:10000000-11000000

Look at VCF in text and in IGV, compare with bam file.

- Run HaplotypeCaller in GVCF mode (banded and BP_RESOLUTION)

See http://www.broadinstitute.org/gatk/guide/article?id=3893

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.g.vcf -L 20:10000000-11000000 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Compare to regular VCF.

Geraldine Van der Auwera, PhD

Tagged:

Comments

Sign In or Register to comment.