FASTA - Reference genome format
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file, validated according to the FASTA standard. All the standard IUPAC bases are accepted, while non-standard bases (i.e. other than ACGT, such as W for example) will be ignored, meaning those positions in the genome will be skipped. Note also that because commonly used programs such as Picard and Samtools treat spaces in contig names differently, we recommend that you avoid using spaces in contig names if you're making your own genome reference.
Most GATK tools additionally require that the main FASTA file be accompanied by a dictionary file ending in
.dict and an index file ending in
.fai, because it allows efficient random access to the reference bases. GATK will look for these index files based on their name, so it's important that they have the same basename as the FASTA file. If you don't have these files available for your organism's reference file, you can generate them very easily; instructions are included below.
If you are working with human data, we recommend you use one of the reference genome builds that we provide in our Resource Bundle or in FireCloud, our cloud-based analysis portal. We currently support GRCh38/hg38 and b37 (and to a lesser extent, hg19). For more information on the human genome reference builds, see this document.
Common problems with reference files
The most common reference-related issue people encounter is an incompatibility between some of the data and/or resources that were derived from (or mapped to) different reference builds. Read more about that problem and how to solve it in this Solutions doc.
Some people have also reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.
Instructions for generating the dictionary and index files
Creating the FASTA sequence dictionary file
We use the CreateSequenceDictionary tool to create a
.dict file from a FASTA file. Note that we only specify the input reference; the tool will name the output appropriately automatically.
gatk-launch CreateSequenceDictionary -R ref.fasta
This produces a SAM-style header file named
ref.dict describing the contents of our FASTA file.
@HD VN:1.5 @SQ SN:20 LN:63025520 M5:0dec9660ec1efaaf33281c0d5ea2560f UR:file:/Users/vdauwera/Desktop/germline_mini/ref/ref.fasta
Here we are using a tiny reference file with a single contig, chromosome 20 from the human b37 reference genome, that we use for demo purposes. If we were running on the full human reference genome there would be many more contigs listed.
Creating the fasta index file
We use the
faidx command in Samtools to prepare the FASTA index file. This file describes byte offsets in the FASTA file for each contig, allowing us to compute exactly where to find a particular reference base at specific genomic coordinates in the FASTA file.
> samtools faidx ref.fasta
This produces a text file named
ref.fasta.fai with one record per line for each of the FASTA contigs. Each record is of the contig, size, location, basesPerLine and bytesPerLine. The index file produced above looks like this:
20 63025520 4 60 61
This shows that our FASTA file contains chromosome 20, which is 63025520 bases long, then the coordinates within the file which you don't need to care about.