The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

(howto) Prepare a reference for use with BWA and GATK

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,727 admin
edited August 2016 in Archive

NOTE: This tutorial has been replaced by a more recent version that uses GRCh38 that you can find here.


Objective

Prepare a reference sequence so that it is suitable for use with BWA and GATK.

Prerequisites

  • Installed BWA
  • Installed SAMTools
  • Installed Picard

Steps

  1. Generate the BWA index
  2. Generate the Fasta file index
  3. Generate the sequence dictionary

1. Generate the BWA index

Action

Run the following BWA command:

bwa index -a bwtsw reference.fa 

where -a bwtsw specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.

Expected Result

This creates a collection of files used by BWA to perform the alignment.


2. Generate the fasta file index

Action

Run the following SAMtools command:

samtools faidx reference.fa 

Expected Result

This creates a file called reference.fa.fai, with one record per line for each of the contigs in the FASTA reference file. Each record is composed of the contig name, size, location, basesPerLine and bytesPerLine.


3. Generate the sequence dictionary

Action

Run the following Picard command:

java -jar picard.jar CreateSequenceDictionary \
    REFERENCE=reference.fa \ 
    OUTPUT=reference.dict 

Note that this is the new syntax for use with the latest version of Picard. Older versions used a slightly different syntax because all the tools were in separate jars, so you'd call e.g. java -jar CreateSequenceDictionary.jar directly.

Expected Result

This creates a file called reference.dict formatted like a SAM header, describing the contents of your reference FASTA file.

Geraldine Van der Auwera, PhD

Post edited by shlee on
Tagged:

Comments

  • pdexheimerpdexheimer Member, Dev Posts: 544 ✭✭✭✭

    Hey Geraldine - I suspect that these howtos are still being written, but I wanted to point out a couple of issues with the BWA index step (#1 above): 1) The default mode in bwa index won't handle a mammalian-sized genome, you need to supply the -a bwtsw argument, and 2) I'm pretty sure that bwa index doesn't generate .fai files (though I haven't tested and could certainly be wrong)

  • KurtKurt Member Posts: 255 ✭✭✭

    @pdexheimer said, "I'm pretty sure that bwa index doesn't generate .fai files (though I haven't tested and could certainly be wrong)". This is true. Otherwise samtools faidx ref.fa would be redundant.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,727 admin

    Well spotted guys, thank you! I'll blame bad editing.

    Geraldine Van der Auwera, PhD

  • michaelchaomichaelchao Member Posts: 16
    edited March 2014

    Hi

    Do we still follow this process if we use the GATK's resource bundle? The bundle already contains:

    human_g1k_v37.fasta

    human_g1k_v37.dict

    human_g1k_v37.fasta.fai

    For the files provided in the bundle, do we just need to create the bwa indices for GRCh37 using the following command:

    bwa index human_g1k_v37.fasta

    or are there any additional preparations for the reference genome before the alignment stage?

    Thank you for your help

  • KurtKurt Member Posts: 255 ✭✭✭

    You can, just need to note the addition that @pdexheimer mentions above in regards to bwa index. On a more subtle note, indexes generated via bwa index -a bwtsw for pre/post bwa 0.6 versions are different (due to the coding that allows bwa indexing of genomes greater than 4Gb). In short, all you should have to do with any version of bwa greater than 0.5-9 is index the genome with bwa index -a bwtsw. You get the rest you need when you unzip the files that the GATK resource bundle provides.

  • engrasif09engrasif09 PakistanMember Posts: 6

    Looks like a stupid ask but can somebody tell me from where can I get .fa file?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,843 admin

    @engrasif09‌

    Hi,

    If you are working with human samples, we provide the reference builds. They are located in our resource bundle. Please read about the resource bundle here: http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server
    Once you log in, you can click on bundle then 2.8 where you will find the latest builds.

    -Sheila

  • sadiqsadiq pakistanMember Posts: 9

    hello All
    java -jar CreateSequenceDictionary.jar \
    REFERENCE=reference.fa \
    OUTPUT=reference.dict
    doesnt not work for me. How to give path to this command i am new to ubuntu. Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,843 admin

    @sadiq

    Hi,

    The best thing to do is google how to set a path. You need to find where exactly your picard folder is stored on your computer and specify that location in the command.

    For example, my picard folder is in my Applications directory, so my path looks like this:
    /Applications/picard/CreateSequenceDictionary.jar

    -Sheila

  • sadiqsadiq pakistanMember Posts: 9

    Hello sheila,

    Thank you for your abrupt reply.
    I am facing this issue
    muhammad@muhammad-7G-Series:~/Desktop/Gatk$ /home/muhammad/Desktop java -jar CreateSequenceDictionary.jar \ REFERENCE=ucsc.hg19.fasta \OUTPUT=ucsc.hg19.dict
    bash: /home/muhammad/Desktop: Is a directory
    Picard folder is located on my desktop.
    Please help.

  • sadiqsadiq pakistanMember Posts: 9

    Ok i have fixed it. Thanks shiela for being innovative.

  • lcollopylcollopy Member Posts: 2

    I've seem some people have similar issues but I can't fix that I get the error 'Unable to access jarfile CreateSequenceDictionary.jar'.

    I just spend several hours trying to add the GenomeAnalysisTK.jar file to my PATH and I think it worked because the test 'java -jar GenomeAnalysisTK.jar -h' shows something!

    So forgive my ignorance but I thought commands for that folder were accessible without having to specify complicated paths. Anyway I have opened the Picard folder in the GenomeAnalysisTK.jar file to and I don't have a CreateSequenceDictionary.jar folder at all - I thought I might just move it manually to the same directory as my reference.fa.

    Where am I going wrong?!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,843 admin

    @lcollopy
    Hi,

    CreateSequenceDictionary.jar is not part of GATK, it is part of Picard. You need to specify the path to your Picard folder, not your GATK folder. For example, my Picard folder is in my Applications folder, so my command looks like:

    java -jar /Applications/picard/CreateSequenceDictionary.jar

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,727 admin

    @lcollopy Part of your confusion may be because you're familiar with invoking self-contained programs (like BWA) that you can add to your path and simply call by name. With java programs it's different because you invoke java itself by name, but then you have to tell it which jar file you want to run. That you cannot simply call by name in the "usual" way; you have to specify the full path to the jar. But you can set up an environment variable to avoid retyping it each time.

    Geraldine Van der Auwera, PhD

  • JeegarJeegar CanadaMember Posts: 16

    This the error which I got after folllowing a proper command:

    lindas-mac-pro:~ lindakohn$ java -jar /Users/lindakohn/desktop/tools/picard/CreateSequenceDictionary.jar \REFERENCE= Users/lindakohn/desktop/PD/geomyces_destructans_20631-21_1_supercontigs.fasta \OUTPUT=geomyces_destructans_20631-21_1_supercontigs.dict
    [Sat Aug 01 17:48:36 EDT 2015] picard.sam.CreateSequenceDictionary REFERENCE=Users/lindakohn/desktop/PD/geomyces_destructans_20631-21_1_supercontigs.fasta OUTPUT=geomyces_destructans_20631-21_1_supercontigs.dict TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Sat Aug 01 17:48:36 EDT 2015] Executing as lindakohn@lindas-mac-pro.utmfs.ads.utm.utoronto.ca on Mac OS X 10.10.3 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Picard version: 1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805) JdkDeflater
    [Sat Aug 01 17:48:36 EDT 2015] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=1029177344
    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
    Exception in thread "main" htsjdk.samtools.SAMException: Error opening file: geomyces_destructans_20631-21_1_supercontigs.fasta
    at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:427)
    at htsjdk.samtools.reference.FastaSequenceFile.(FastaSequenceFile.java:53)
    at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:81)
    at picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:136)
    at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:121)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
    at picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:97)
    Caused by: java.io.FileNotFoundException: Users/lindakohn/desktop/PD/geomyces_destructans_20631-21_1_supercontigs.fasta (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.(FileInputStream.java:138)
    at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:423)
    ... 6 more

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,843 admin

    @Jeegar
    Hi,

    The error message tells me the fasta file is either not in the path you specified, or the directory does not exist. You will have to check which one is the problem.

    -Sheila

  • tc19tc19 Seoul, KoreaMember Posts: 3

    Hello

    I have installed bwa-0.7.13 successfully and tried to index the UCSC reference genome hg19 but got a "can not allocate memory" error.
    I have downloaded the reference file from the resources files here. I have searched google for solution but could not get anything specific.
    I am using a virtual machine of ubuntu 14.04 with 4GB RAM and 60 GB hard disk space. Kindly help.

    Issue · Github
    by Sheila

    Issue Number
    735
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,843 admin

    @tc19
    Hi,

    Unfortunately, we cannot help you as you are running BWA, not GATK tools.

    -Sheila

  • tc19tc19 Seoul, KoreaMember Posts: 3

    Hi Sheila

    Thank you for your response. I seem to have found the solution.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,843 admin

    @tc19
    Hi,

    Great! If you would like to share your resolution here, perhaps your solution will help others on the forum.

    -Sheila

  • tc19tc19 Seoul, KoreaMember Posts: 3

    Hi
    I found out from the BWA site that to index human genome BWA needs atleast 5GB memory. I was running it with 4GB RAM hence I was getting the "cannot allocate memory error". I increased my RAM to 5.5GB in the virtual machine and was able to index the hg19 reference gene.

  • RaviKumarSindhuRaviKumarSindhu Silver Spring, MDMember Posts: 6

    Hi,
    I have a question about index of human reference genome using BWA. Is it ok to index the "zipped" fasta file of Human reference genome or one should ist unzip it and then index it using BWA ?.

    (1). bwa index -a bwtsw human_g1k_v37.fasta.gz
    OR
    (2). bwa index -a bwtsw human_g1k_v37.fasta

    Although BWA index both these ways but i want to know whether indexing zipped fasta file is OK or not ?. thanks - Ravi

  • RaviKumarSindhuRaviKumarSindhu Silver Spring, MDMember Posts: 6

    "Gzipped fasta files will not work with the GATK, so please make sure to unzip them first."

  • airuckairuck CaliforniaMember Posts: 1

    One of my colleagues was confused by the lack of angle brackets around "reference.fa". Without brackets it appears as a literal. It seems like a minor issue, but this is much more common and less confusing to CS folks:

    bwa index -a bwtsw <reference.fa>

    samtools faidx <reference.fa>

    java -jar picard.jar CreateSequenceDictionary \
    REFERENCE=<reference.fa> \
    OUTPUT=<reference.dict>

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,727 admin

    @airuck The difficulty here is that we've got a very mixed audience; overall we believe we have more users who have more bio science than CS background, and understand the literal form better. If we include angle brackets we have many users who will type them into their command. If we don't, they typically understand that they need to edit the file name value to match what they are working with. So we cater more to that part of our audience, and assume that CS folks will figure it out reasonably quickly what our conventions are.

    Geraldine Van der Auwera, PhD

  • dotkesnerdotkesner mghMember Posts: 1

    You you have an equivalent for STAR. I am running PICARD to in preparation for running GATK and am running into errors.

    Here for instance:

    I am running into a problem with PICARD where by the validation of PICARD does not expect extended CIGAR operators. There is an S and and H in the offending CIGAR.

    HERE are the offending reads:

    D64TDFP1:323:C83UKACXX:1:1309:1615:65600 99 chr1 71619089 60 50M = 71619128 59 TCCCGGCTTCGGTGCAGCGCACCGAGGTCCCCAGGCACAGGACTGCCAGC BBCFFFFFHHHHFHHIJIIJJJJJJJJHIJJJJJJJJJJJHHHHFFFFFD MD:Z:50 PG:Z:MarkDuplicates RG:Z:1 NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:69
    D64TDFP1:323:C83UKACXX:1:1309:1615:65600 147 chr1 71619178 60 25H25S = 71619089 -59 GACTGACAGCCACCGGGGATTGAGG JJJJIJJJJJJJHHHHHFFFFFCC@ MD:Z:20 PG:Z:MarkDuplicates RG:Z:1 NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:69


    Here is the error message when I try to index the split.bam in preparing to do
    var calls.

    [Tue Nov 22 02:34:21 EST 2016] picard.sam.BuildBamIndex INPUT=split.bam VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Tue Nov 22 02:34:21 EST 2016] Executing as bak22@cmu111.research.partners.org on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26; Picard version: 2.7.1-20-gf8a93c9-SNAPSHOT
    [Tue Nov 22 02:34:25 EST 2016] picard.sam.BuildBamIndex done. Elapsed time: 0.08 minutes.
    Runtime.totalMemory()=1417674752
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name D64TDFP1:323:C83UKACXX:1:1309:1615:65600, No real operator (M|I|D|N) in CIGAR
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
    at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:252)
    at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:600)
    at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1531)
    at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2038)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:664)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:650)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:620)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
    at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:305)
    at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:289)
    at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:147)
    1,9 Top

    Issue · Github
    by Sheila

    Issue Number
    1466
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 585 admin

    Hi @dotkesner,

    Your second in pair read, with CIGAR 25H25S, does not adhere to SAM specifications. Half the read is gone (hardclipped) and the other half is softclipped making none of the sequence aligned. The position field is meant to indicate the leftmost mapping base and in this case such a coordinate does not make sense.

    I would think such a read should instead be unmapped.

    Can you tell me are there many such odd reads in your BAM and are these intended by your aligner? Also, are you following the 2-pass method for RNA-Seq workflows?

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 585 admin

    Hi @dotkesner, I've talked to some folks here and apparently such odd alignments occur fairly frequently. My first suggestion is that you try to index with samtools index. My other less likely suggestion is to try Picard SortSam to index. I'm going to guess this latter tool will give the same error as BuildBamIndex but it is worth confirming this (if you don't mind). Let me know if one or both of these work for you.

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 585 admin
    edited November 2016

    @dotkesner, It occurs to me that if you want to remove such problematic reads, you can also use GATK's PrintReads in --unsafe mode (so the tool will take an unindexed BAM). This allows you to utilize GATK Engine filters. For example, one of the default filters, BadCigarFilter, should remove such reads based on the "fully hard or soft clipped" criteria. This PrintReads step should generate an index but because you're running the tool in unsafe mode, just to be safe, I would regenerate the index in a separate step with the resulting file.

    Also, take a look at the ValidateSamFile document.

    Post edited by shlee on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,727 admin

    @dotkesner More immediately, try relaxing the validation stringency parameter -- that will tell Picard not to flip out on reads that don't make sense but aren't actually broken. By default it's set to VALIDATION_STRINGENCY=STRICT but you can set it to LENIENT, I believe.

    This is safer than trying to use GATK in unsafe mode.

    Geraldine Van der Auwera, PhD

This discussion has been closed.