Workshop walkthrough (Brussels 2014)

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,391Administrator, GATK Developer admin

Note: this is a walkthrough of a hands-on GATK tutorial given at the Royal Institute of Natural Sciences on June 26, 2014 in Brussels, Belgium. It is intended to be performed with version 3.1-2 of the GATK and the corresponding data bundle.

Data files

We start with a BAM file called "NA12878.wgs.1lib.bam" (along with its index, "NA12878.wgs.1lib.bai") containing Illumina sequence reads from our favorite test subject, NA12878, that have been mapped using BWA-mem and processed using Picard tools according to the instructions here:

http://www.broadinstitute.org/gatk/guide/article?id=2799

Note that this file only contains sequence for a small region of chromosome 20, in order to minimize the file size and speed up the processing steps, for demonstration purposes. Normally you would run the steps in this tutorial on the entire genome (or exome).

This subsetted file was prepared by extracting read group 20GAV.1 from the CEUTrio.HiSeq.WGS.b37.NA12878.bam that is available in our resource bundle, using the following command:

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I CEUTrio.HiSeq.WGS.b37.NA12878.bam -o NA12878.wgs.1lib.bam -L 20 -rf SingleReadGroup -goodRG 20GAV.1

(We'll explain later in the tutorial how to use this kind of utility function to manipulate BAM files.)

We also have our human genome reference, called "human_g1k_v37.fasta", which has been prepared according to the instructions here:

http://www.broadinstitute.org/gatk/guide/article?id=2798

We will walk through both of these tutorials to explain the processing, but without actually running the steps to save time.

And finally we have a few resource files containing known variants (dbsnp, mills indels). These files are all available in the resource bundle on our FTP server. See here for access instructions:

http://www.broadinstitute.org/gatk/guide/article?id=1215


DAY 1

Prelude: BAM manipulation with Picard and Samtools

- Viewing a BAM file information

See also the Samtools docs:

http://samtools.sourceforge.net/samtools.shtml

- Reverting a BAM file

Clean the BAM we are using of previous GATK processing using this Picard command:

java -jar RevertSam.jar I=NA12878.wgs.1lib.bam O=aligned_reads_20.bam RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=false SORT_ORDER=coordinate

Note that it is possible to revert the file to FastQ format by setting REMOVE_ALIGNMENT_INFORMATION=true, but this method leads to biases in the alignment process, so if you want to do that, the better method is to follow the instructions given here:

http://www.broadinstitute.org/gatk/guide/article?id=2908

See also the Picard docs:

http://picard.sourceforge.net/command-line-overview.shtml

Mark Duplicates

See penultimate step of http://www.broadinstitute.org/gatk/guide/article?id=2799

After a few minutes, the file (which we'll call "dedupped_20.bam") is ready for use with GATK.

Interlude: tour of the documentation, website, forum etc. Also show how to access the bundle on the FTP server with FileZilla.

Getting to know GATK

Before starting to run the GATK Best Practices, we are going to learn about the basic syntax of GATK, how the results are output, how to interpret error messages, and so on.

- Run a simple walker: CountReads

Identify basic syntax, console output: version, command recap line, progress estimates, result if applicable.

java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20

- Add a filter to count how many duplicates were marked

Look at the filtering summary.

java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20 -rf DuplicateRead

- Demonstrate how to select a subset of read data

This can come in handy for bug reports.

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20:10000000-11000000 -o snippet.bam

Also show how a bug report should be formatted and submitted. See
http://www.broadinstitute.org/gatk/guide/article?id=1894

- Demonstrate the equivalent for variant calls

Refer to docs for many other capabilities including selecting by sample name, up to complex queries.

java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta -V dbsnp_b37_20.vcf -o snippet.vcf -L 20:10000000-11000000

See http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html


GATK Best Practices for data processing (DNA seq)

These steps should typically be performed per lane of data. Here we are running the tools on a small slice of the data, to save time and disk space, but normally you would run on the entire genome or exome. This is especially important for BQSR, which does not work well on small amounts of data.

Now let's pick up where we left off after Marking Duplicates.

- Realign around Indels

See http://gatkforums.broadinstitute.org/discussion/2800/howto-perform-local-realignment-around-indels

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37 -o target_intervals.list -L 20:10000000-11000000 

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals target_intervals.list -o realigned.bam -L 20:10000000-11000000 

- Base recalibration

See http://gatkforums.broadinstitute.org/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I realigned_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o recal_20.table -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I realigned_20.bam -BQSR recal_20.table -o recal_20.bam -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I recalibrated_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o post_recal_20.table -L 20:10000000-11000000

java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_20.table -after post_recal_20.table -plots recalibration_plots.pdf -L 20:10000000-11000000

GATK Best Practices for variant calling (DNA seq)

- Run HaplotypeCaller in regular mode

See http://www.broadinstitute.org/gatk/guide/article?id=2803

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.vcf -L 20:10000000-11000000

Look at VCF in text and in IGV, compare with bam file.

- Run HaplotypeCaller in GVCF mode (banded and BP_RESOLUTION)

See http://www.broadinstitute.org/gatk/guide/article?id=3893

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.g.vcf -L 20:10000000-11000000 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Compare to regular VCF.

Geraldine Van der Auwera, PhD

Tagged:

Comments

  • mayaabmayaab IsraelPosts: 35Member ✭✭

    since I left right before running HaplotypeCaller in different modes, I would appreciate if can write here some highlights of that part.
    Thanks,
    Maya

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,391Administrator, GATK Developer admin

    Hi Maya,

    We mostly talked about the differences you can observe in how regular VCFs and GVCFs are structured, how to read them (in the text and in IGV), and what are the key argument differences. In the near future we'll try to develop some more detailed text for these tutorials, but to be honest it won't happen immediately as we have a few other priorities to take care of first.

    Geraldine Van der Auwera, PhD

  • mayaabmayaab IsraelPosts: 35Member ✭✭

    Thanks. the workshop was great, and it was nice to meet you all

  • astrandastrand New YorkPosts: 46Member

    Just FYI: there is a typo in the realign indels section when using RealignerTargetCreator: the file Mills_and_1000G_gold_standard.indels.b37 should have a .vcf extension added to its name.

  • astrandastrand New YorkPosts: 46Member

    Also, in the base recalibration step, shouldn't the input file when using PrintReads be realigned.bam as per the previous step (realign indels)? When using IndelRealigner the output file was named realigned.bam.

  • astrandastrand New YorkPosts: 46Member

    I'm sorry to be leaving several comments here but in the base recalibration step, I believe the input file should be recal_20.bam instead of recalibrated_20.bam.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,391Administrator, GATK Developer admin

    @astrand All your observations are correct - there are a few mistakes in filenames in that document, because it was produced by combining several tutorials that used different file naming conventions.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.