The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

(howto) Prepare a reference for use with BWA and GATK

edited August 2016 in Archive

NOTE: This tutorial has been replaced by a more recent version that uses GRCh38 that you can find here.

Objective

Prepare a reference sequence so that it is suitable for use with BWA and GATK.

Prerequisites

• Installed BWA
• Installed SAMTools
• Installed Picard

Steps

1. Generate the BWA index
2. Generate the Fasta file index
3. Generate the sequence dictionary

1. Generate the BWA index

Action

Run the following BWA command:

bwa index -a bwtsw reference.fa


where -a bwtsw specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.

Expected Result

This creates a collection of files used by BWA to perform the alignment.

2. Generate the fasta file index

Action

Run the following SAMtools command:

samtools faidx reference.fa


Expected Result

This creates a file called reference.fa.fai, with one record per line for each of the contigs in the FASTA reference file. Each record is composed of the contig name, size, location, basesPerLine and bytesPerLine.

3. Generate the sequence dictionary

Action

Run the following Picard command:

java -jar picard.jar CreateSequenceDictionary \
REFERENCE=reference.fa \
OUTPUT=reference.dict


Note that this is the new syntax for use with the latest version of Picard. Older versions used a slightly different syntax because all the tools were in separate jars, so you'd call e.g. java -jar CreateSequenceDictionary.jar directly.

Expected Result

This creates a file called reference.dict formatted like a SAM header, describing the contents of your reference FASTA file.

Post edited by shlee on

Geraldine Van der Auwera, PhD

Tagged:

• Member, Dev Posts: 543 ✭✭✭✭

Hey Geraldine - I suspect that these howtos are still being written, but I wanted to point out a couple of issues with the BWA index step (#1 above): 1) The default mode in bwa index won't handle a mammalian-sized genome, you need to supply the -a bwtsw argument, and 2) I'm pretty sure that bwa index doesn't generate .fai files (though I haven't tested and could certainly be wrong)

• Member Posts: 255 ✭✭✭

@pdexheimer said, "I'm pretty sure that bwa index doesn't generate .fai files (though I haven't tested and could certainly be wrong)". This is true. Otherwise samtools faidx ref.fa would be redundant.

Well spotted guys, thank you! I'll blame bad editing.

Geraldine Van der Auwera, PhD

• Member Posts: 16
edited March 2014

Hi

Do we still follow this process if we use the GATK's resource bundle? The bundle already contains:

human_g1k_v37.fasta

human_g1k_v37.dict

human_g1k_v37.fasta.fai

For the files provided in the bundle, do we just need to create the bwa indices for GRCh37 using the following command:

bwa index human_g1k_v37.fasta

or are there any additional preparations for the reference genome before the alignment stage?

• Member Posts: 255 ✭✭✭

You can, just need to note the addition that @pdexheimer mentions above in regards to bwa index. On a more subtle note, indexes generated via bwa index -a bwtsw for pre/post bwa 0.6 versions are different (due to the coding that allows bwa indexing of genomes greater than 4Gb). In short, all you should have to do with any version of bwa greater than 0.5-9 is index the genome with bwa index -a bwtsw. You get the rest you need when you unzip the files that the GATK resource bundle provides.

• PakistanMember Posts: 6

Looks like a stupid ask but can somebody tell me from where can I get .fa file?

@engrasif09‌

Hi,

If you are working with human samples, we provide the reference builds. They are located in our resource bundle. Please read about the resource bundle here: http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server
Once you log in, you can click on bundle then 2.8 where you will find the latest builds.

-Sheila

• pakistanMember Posts: 9

hello All
java -jar CreateSequenceDictionary.jar \
REFERENCE=reference.fa \
OUTPUT=reference.dict
doesnt not work for me. How to give path to this command i am new to ubuntu. Thanks

Hi,

The best thing to do is google how to set a path. You need to find where exactly your picard folder is stored on your computer and specify that location in the command.

For example, my picard folder is in my Applications directory, so my path looks like this:
/Applications/picard/CreateSequenceDictionary.jar

-Sheila

• pakistanMember Posts: 9

Hello sheila,

I am facing this issue
muhammad@muhammad-7G-Series:~/Desktop/Gatk$/home/muhammad/Desktop java -jar CreateSequenceDictionary.jar \ REFERENCE=ucsc.hg19.fasta \OUTPUT=ucsc.hg19.dict bash: /home/muhammad/Desktop: Is a directory Picard folder is located on my desktop. Please help. • pakistanMember Posts: 9 Ok i have fixed it. Thanks shiela for being innovative. • Member Posts: 2 I've seem some people have similar issues but I can't fix that I get the error 'Unable to access jarfile CreateSequenceDictionary.jar'. I just spend several hours trying to add the GenomeAnalysisTK.jar file to my PATH and I think it worked because the test 'java -jar GenomeAnalysisTK.jar -h' shows something! So forgive my ignorance but I thought commands for that folder were accessible without having to specify complicated paths. Anyway I have opened the Picard folder in the GenomeAnalysisTK.jar file to and I don't have a CreateSequenceDictionary.jar folder at all - I thought I might just move it manually to the same directory as my reference.fa. Where am I going wrong?! • Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,456 admin @lcollopy Hi, CreateSequenceDictionary.jar is not part of GATK, it is part of Picard. You need to specify the path to your Picard folder, not your GATK folder. For example, my Picard folder is in my Applications folder, so my command looks like: java -jar /Applications/picard/CreateSequenceDictionary.jar -Sheila • Administrator, Dev Posts: 11,163 admin @lcollopy Part of your confusion may be because you're familiar with invoking self-contained programs (like BWA) that you can add to your path and simply call by name. With java programs it's different because you invoke java itself by name, but then you have to tell it which jar file you want to run. That you cannot simply call by name in the "usual" way; you have to specify the full path to the jar. But you can set up an environment variable to avoid retyping it each time. Geraldine Van der Auwera, PhD • CanadaMember Posts: 16 This the error which I got after folllowing a proper command: lindas-mac-pro:~ lindakohn$ java -jar /Users/lindakohn/desktop/tools/picard/CreateSequenceDictionary.jar \REFERENCE= Users/lindakohn/desktop/PD/geomyces_destructans_20631-21_1_supercontigs.fasta \OUTPUT=geomyces_destructans_20631-21_1_supercontigs.dict
[Sat Aug 01 17:48:36 EDT 2015] picard.sam.CreateSequenceDictionary REFERENCE=Users/lindakohn/desktop/PD/geomyces_destructans_20631-21_1_supercontigs.fasta OUTPUT=geomyces_destructans_20631-21_1_supercontigs.dict TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sat Aug 01 17:48:36 EDT 2015] Executing as lindakohn@lindas-mac-pro.utmfs.ads.utm.utoronto.ca on Mac OS X 10.10.3 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Picard version: 1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805) JdkDeflater
[Sat Aug 01 17:48:36 EDT 2015] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1029177344
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Error opening file: geomyces_destructans_20631-21_1_supercontigs.fasta
at htsjdk.samtools.reference.FastaSequenceFile.(FastaSequenceFile.java:53)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:81)
at picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:136)
at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:121)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:97)
Caused by: java.io.FileNotFoundException: Users/lindakohn/desktop/PD/geomyces_destructans_20631-21_1_supercontigs.fasta (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
... 6 more

@Jeegar
Hi,

The error message tells me the fasta file is either not in the path you specified, or the directory does not exist. You will have to check which one is the problem.

-Sheila

• Seoul, KoreaMember Posts: 3

Hello

I have installed bwa-0.7.13 successfully and tried to index the UCSC reference genome hg19 but got a "can not allocate memory" error.
I have downloaded the reference file from the resources files here. I have searched google for solution but could not get anything specific.
I am using a virtual machine of ubuntu 14.04 with 4GB RAM and 60 GB hard disk space. Kindly help.

Issue · Github March 2016 by Sheila

Issue Number
735
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

@tc19
Hi,

Unfortunately, we cannot help you as you are running BWA, not GATK tools.

-Sheila

• Seoul, KoreaMember Posts: 3

Hi Sheila

Thank you for your response. I seem to have found the solution.

@tc19
Hi,

Great! If you would like to share your resolution here, perhaps your solution will help others on the forum.

-Sheila

• Seoul, KoreaMember Posts: 3

Hi
I found out from the BWA site that to index human genome BWA needs atleast 5GB memory. I was running it with 4GB RAM hence I was getting the "cannot allocate memory error". I increased my RAM to 5.5GB in the virtual machine and was able to index the hg19 reference gene.

• Silver Spring, MDMember Posts: 6

Hi,
I have a question about index of human reference genome using BWA. Is it ok to index the "zipped" fasta file of Human reference genome or one should ist unzip it and then index it using BWA ?.

(1). bwa index -a bwtsw human_g1k_v37.fasta.gz
OR
(2). bwa index -a bwtsw human_g1k_v37.fasta

Although BWA index both these ways but i want to know whether indexing zipped fasta file is OK or not ?. thanks - Ravi

• Silver Spring, MDMember Posts: 6

"Gzipped fasta files will not work with the GATK, so please make sure to unzip them first."

• CaliforniaMember Posts: 1

One of my colleagues was confused by the lack of angle brackets around "reference.fa". Without brackets it appears as a literal. It seems like a minor issue, but this is much more common and less confusing to CS folks:

bwa index -a bwtsw <reference.fa>

samtools faidx <reference.fa>

java -jar picard.jar CreateSequenceDictionary \
REFERENCE=<reference.fa> \
OUTPUT=<reference.dict>

@airuck The difficulty here is that we've got a very mixed audience; overall we believe we have more users who have more bio science than CS background, and understand the literal form better. If we include angle brackets we have many users who will type them into their command. If we don't, they typically understand that they need to edit the file name value to match what they are working with. So we cater more to that part of our audience, and assume that CS folks will figure it out reasonably quickly what our conventions are.

Geraldine Van der Auwera, PhD

• mghMember Posts: 1

You you have an equivalent for STAR. I am running PICARD to in preparation for running GATK and am running into errors.

Here for instance:

I am running into a problem with PICARD where by the validation of PICARD does not expect extended CIGAR operators. There is an S and and H in the offending CIGAR.

D64TDFP1:323:C83UKACXX:1:1309:1615:65600 99 chr1 71619089 60 50M = 71619128 59 TCCCGGCTTCGGTGCAGCGCACCGAGGTCCCCAGGCACAGGACTGCCAGC BBCFFFFFHHHHFHHIJIIJJJJJJJJHIJJJJJJJJJJJHHHHFFFFFD MD:Z:50 PG:Z:MarkDuplicates RG:Z:1 NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:69
D64TDFP1:323:C83UKACXX:1:1309:1615:65600 147 chr1 71619178 60 25H25S = 71619089 -59 GACTGACAGCCACCGGGGATTGAGG JJJJIJJJJJJJHHHHHFFFFFCC@ MD:Z:20 PG:Z:MarkDuplicates RG:Z:1 NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:69

Here is the error message when I try to index the split.bam in preparing to do
var calls.

[Tue Nov 22 02:34:21 EST 2016] picard.sam.BuildBamIndex INPUT=split.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Nov 22 02:34:21 EST 2016] Executing as bak22@cmu111.research.partners.org on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26; Picard version: 2.7.1-20-gf8a93c9-SNAPSHOT
[Tue Nov 22 02:34:25 EST 2016] picard.sam.BuildBamIndex done. Elapsed time: 0.08 minutes.
Runtime.totalMemory()=1417674752
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name D64TDFP1:323:C83UKACXX:1:1309:1615:65600, No real operator (M|I|D|N) in CIGAR
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:252)
at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:600)
at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1531)
at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2038)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:664) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:650)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:620) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:305)
at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:289)
at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:147)
1,9 Top

Issue · Github November 2016 by Sheila

Issue Number
1466
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Hi @dotkesner,

Your second in pair read, with CIGAR 25H25S, does not adhere to SAM specifications. Half the read is gone (hardclipped) and the other half is softclipped making none of the sequence aligned. The position field is meant to indicate the leftmost mapping base and in this case such a coordinate does not make sense.

Can you tell me are there many such odd reads in your BAM and are these intended by your aligner? Also, are you following the 2-pass method for RNA-Seq workflows?

Hi @dotkesner, I've talked to some folks here and apparently such odd alignments occur fairly frequently. My first suggestion is that you try to index with samtools index. My other less likely suggestion is to try Picard SortSam to index. I'm going to guess this latter tool will give the same error as BuildBamIndex but it is worth confirming this (if you don't mind). Let me know if one or both of these work for you.

edited November 2016

@dotkesner, It occurs to me that if you want to remove such problematic reads, you can also use GATK's PrintReads in --unsafe mode (so the tool will take an unindexed BAM). This allows you to utilize GATK Engine filters. For example, one of the default filters, BadCigarFilter, should remove such reads based on the "fully hard or soft clipped" criteria. This PrintReads step should generate an index but because you're running the tool in unsafe mode, just to be safe, I would regenerate the index in a separate step with the resulting file.

Also, take a look at the ValidateSamFile document.

Post edited by shlee on
@dotkesner More immediately, try relaxing the validation stringency parameter -- that will tell Picard not to flip out on reads that don't make sense but aren't actually broken. By default it's set to VALIDATION_STRINGENCY=STRICT but you can set it to LENIENT`, I believe.