Picard 2.10.2 is now available at
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

How can I prepare a FASTA file to use as reference?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited September 2013 in FAQs

This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial.

Why these steps are necessary

The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.

NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.

Creating the fasta sequence dictionary file

We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.

> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
44.922u 2.308s 0:47.09 100.2%   0+0k 0+0io 2pf+0w

This produces a SAM-style header file describing the contents of our fasta file.

> cat Homo_sapiens_assembly18.dict 
@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:247249719    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ     SN:chr2 LN:242951149    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:b12c7373e3882120332983be99aeb18d
@SQ     SN:chr3 LN:199501827    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0e48ed7f305877f66e6fd4addbae2b9a
@SQ     SN:chr4 LN:191273063    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cf37020337904229dca8401907b626c2
@SQ     SN:chr5 LN:180857866    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:031c851664e31b2c17337fd6f9004858
@SQ     SN:chr6 LN:170899992    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfe8005c536131276d448ead33f1b583
@SQ     SN:chr7 LN:158821424    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:74239c5ceee3b28f0038123d958114cb
@SQ     SN:chr8 LN:146274826    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ     SN:chr9 LN:140273252    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:ea244473e525dde0393d353ef94f974b
@SQ     SN:chr10        LN:135374737    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ     SN:chr11        LN:134452384    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ     SN:chr12        LN:132349534    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d17d70060c56b4578fa570117bf19716
@SQ     SN:chr13        LN:114142980    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c4f3084a20380a373bbbdb9ae30da587
@SQ     SN:chr14        LN:106368585    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ     SN:chr15        LN:100338915    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ     SN:chr16        LN:88827254     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3e81884229e8dc6b7f258169ec8da246
@SQ     SN:chr17        LN:78774742     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2a5c95ed99c5298bb107f313c7044588
@SQ     SN:chr18        LN:76117153     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3d11df432bcdc1407835d5ef2ce62634
@SQ     SN:chr19        LN:63811651     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2f1a59077cfad51df907ac25723bff28
@SQ     SN:chr20        LN:62435964     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ     SN:chr21        LN:46944323     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ     SN:chr22        LN:49691432     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2041e6a0c914b48dd537922cca63acb8
@SQ     SN:chrX LN:154913754    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d7e626c80ad172a4d7c95aadb94d9040
@SQ     SN:chrY LN:57772954     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:62f69d0e82a12af74bad85e2e4a8bd91
@SQ     SN:chr1_random  LN:1663265      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cc05cb1554258add2eb62e88c0746394
@SQ     SN:chr2_random  LN:185571       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ     SN:chr3_random  LN:749256       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cc571e918ac18afa0b2053262cadab6
@SQ     SN:chr4_random  LN:842648       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cab2949ccf26ee0f69a875412c93740
@SQ     SN:chr5_random  LN:143687       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:05926bdbff978d4a0906862eb3f773d0
@SQ     SN:chr6_random  LN:1875562      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d62eb2919ba7b9c1d382c011c5218094
@SQ     SN:chr7_random  LN:549659       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ     SN:chr8_random  LN:943810       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ     SN:chr9_random  LN:1146434      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ     SN:chr10_random LN:113275       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:50be2d2c6720dabeff497ffb53189daa
@SQ     SN:chr11_random LN:215294       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfc93adc30c621d5c83eee3f0d841624
@SQ     SN:chr13_random LN:186858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:563531689f3dbd691331fd6c5730a88b
@SQ     SN:chr15_random LN:784346       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bf885e99940d2d439d83eba791804a48
@SQ     SN:chr16_random LN:105485       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ     SN:chr17_random LN:2617613      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ     SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ     SN:chr19_random LN:301858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:420ce95da035386cc8c63094288c49e2
@SQ     SN:chr21_random LN:1679693      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:a7252115bfe5bb5525f34d039eecd096
@SQ     SN:chr22_random LN:257318       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4f2d259b82f7647d3b668063cf18378b
@SQ     SN:chrX_random  LN:1719168      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f4d71e0758986c15e5455bf3e14e5d6f

Creating the fasta index file

We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.

> samtools faidx Homo_sapiens_assembly18.fasta 
108.446u 3.384s 2:44.61 67.9%   0+0k 0+0io 0pf+0w

This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:

> cat Homo_sapiens_assembly18.fasta.fai 
chrM    16571   6       50      51
chr1    247249719       16915   50      51
chr2    242951149       252211635       50      51
chr3    199501827       500021813       50      51
chr4    191273063       703513683       50      51
chr5    180857866       898612214       50      51
chr6    170899992       1083087244      50      51
chr7    158821424       1257405242      50      51
chr8    146274826       1419403101      50      51
chr9    140273252       1568603430      50      51
chr10   135374737       1711682155      50      51
chr11   134452384       1849764394      50      51
chr12   132349534       1986905833      50      51
chr13   114142980       2121902365      50      51
chr14   106368585       2238328212      50      51
chr15   100338915       2346824176      50      51
chr16   88827254        2449169877      50      51
chr17   78774742        2539773684      50      51
chr18   76117153        2620123928      50      51
chr19   63811651        2697763432      50      51
chr20   62435964        2762851324      50      51
chr21   46944323        2826536015      50      51
chr22   49691432        2874419232      50      51
chrX    154913754       2925104499      50      51
chrY    57772954        3083116535      50      51
chr1_random     1663265 3142044962      50      51
chr2_random     185571  3143741506      50      51
chr3_random     749256  3143930802      50      51
chr4_random     842648  3144695057      50      51
chr5_random     143687  3145554571      50      51
chr6_random     1875562 3145701145      50      51
chr7_random     549659  3147614232      50      51
chr8_random     943810  3148174898      50      51
chr9_random     1146434 3149137598      50      51
chr10_random    113275  3150306975      50      51
chr11_random    215294  3150422530      50      51
chr13_random    186858  3150642144      50      51
chr15_random    784346  3150832754      50      51
chr16_random    105485  3151632801      50      51
chr17_random    2617613 3151740410      50      51
chr18_random    4262    3154410390      50      51
chr19_random    301858  3154414752      50      51
chr21_random    1679693 3154722662      50      51
chr22_random    257318  3156435963      50      51
chrX_random     1719168 3156698441      50      51
    Hi. I generated a fasta file (sequences of a gene) and followed what you say. Finally, I did get the whole pineline through (get the vcf file). However, I lost coordinates. This is the fasta format

    >1 dna:chromosome chromosome:GRCh37:1:196621008:196716634:1

    I loaded it into IGV, the reads started from coordinate1 instead of 196621008. Am I missing something? I think so, But I could not google it out. Anyone has similar problems?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    If you generated a custom reference with the sequence of just your gene, then this is normal. All the position counting will be done from the start of the sequence in the file, not from the original coordinates of the gene in the genome. If you want the calls in the VCF to have the true genome position coordinates, you should call them using the full genome of your organism. Otherwise you can simply calculate what they should be by adding the call position to the original start position of your gene in the genome. Make sense?

  • Thank you the reply. It is quite helpful.

  • @weihua said:
    Thank you for the reply. It is quite helpful.

    And I assume, without coordinates data, I can not do local realignment (or any procedures which involves coordinates) using files in the bundle.

  • SophiaSophia Member

    What can be done about references containing N's? Can they be used in GATK, e.g. with the variant calling and variant annotation walkers?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    N's should be fine, they will just be skipped.

  • frankibfrankib Sherbrooke, CanadaMember

    Once you created the two files (.dict and .fai) which one do you input in the command line to use the Realigner target creator?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @frankib, neither of those files need to be specified in the command line. Please see the example commands given in the documentation for the tools you want to run.

  • frankibfrankib Sherbrooke, CanadaMember

    Ok Thank you.

  • Hello! Please could you tell me how to get a sorted *.dict for my mm9.fasta reference file? I got it from
    Thanks a lot in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @timwartewig, you can do it with Picard tools as described above. For more details, please see the Tutorials section of the Guide, or check out the Picard project documentation website.

  • Thank you Geraldine. Sorry that I had not wrote what I have already tried. I used picard: java -jar CreateSequenceDictionary.jar R=mm9.fa O=mm9.dict which produced me the dict file. My bam file headers/contigs are sorted with picard SortSam followed by samtools reheader. However, this is the error: ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

    ERROR reads contigs = [chr10, chr11, chr12, chr13, chr13_random, chr14, chr15, chr16, chr16_random, chr17, chr17_random, chr18, chr19, chr1, chr1_random, chr2, chr3, chr3_random, chr4, chr4_random, chr5, chr5_random, chr6, chr7, chr7_random, chr8, chr8_random, chr9, chr9_random, chrM, chrUn_random, chrX, chrX_random, chrY, chrY_random]
    ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chrX, chrY, chrM, chr13_random, chr16_random, chr17_random, chr1_random, chr3_random, chr4_random, chr5_random, chr7_random, chr8_random, chr9_random, chrUn_random, chrX_random, chrY_random]
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh I see. Picard's ReorderSam should fix that for you, see

  • frankibfrankib Sherbrooke, CanadaMember

    I don't understand why I'm able to run the CreateSequenceDictionary without problem but when I run the faidx tool I got the following error:

    open: No such file or directory
    [_razf_open] fail to open human_g1K_v37.fasta
    [fai_build] fail to open the FASTA file human_g1K_v37.fasta

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @frankib, what's your command line?

  • @Geraldine_VdAuwera said:
    frankib, neither of those files need to be specified in the command line. Please see the example commands given in the documentation for the tools you want to run.

    if neither .fai or .dict are needed in the command line, why we generated them in the first place?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @huilin Those files are needed by the tools. You don't write them in the command line because GATK automatically finds them.

  • Hi,
    I am trying to use gatk to convert hapmap#28 release data to vcf. I performed the steps you mentioned here and checked and they look like the examples here. then i right the code in the same way from the site but it gives error: I/O error loading or writing tribble index file
    the files are in text format and it seems the program tries to make a index of the input data but can't. my code: java -jar /softw/GenomeAnalysisTK.jar -T VariantsToVCF -R /mnt/NAS/share/gatk_bundle/2.8/hg18/Homo_sapiens_assembly18.fasta -o output.vcf --variant:RawHapMap /mnt/NAS/projects/2015_ayshin_sift/HAPMAP/hapmap#28/genotypes_chr8_CHB_r28_nr.b36_fwd.txt

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    It looks like the --variant argument cannot accept text files as input. Have a look at the documentation for acceptable file formats:


  • thank you @Sheila I look at that part, it is the same just in gzip form,I tried the code with genotypes_chr1_ASW_r27_nr.b36_fwd.txt.gz somrthing like this (in gz form) it gives an error saying : an index is required, but not found. does this means that I should use samtools to make an index for input file like the way I did for reference?
    thank you

  • Hi again,
    I have a problem using hapmap data as input. I use the RawHapMap in gzip form as shown in the site :
    An index is required, but none gives this error. How am I supposed to make an index of hapmap data?
    thank you

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    You can use Tabix to generate .gz file indices.


  • ok, I will try it now
    thank you

  • d3abb7c9d3abb7c9 Member
    When I run this command:

    java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R /genomes/glaberrima/Oryza_glaberrima.fa -nt 24 -I Oryza_glaberrima-deduped.bam -o target_intervals.list

    I get this error message:

    ##### ERROR MESSAGE: Fasta dict file /genomes/glaberrima/Oryza_glaberrima.dict for reference /genomes/glaberrima/Oryza_glaberrima.fa does not exist.

    I have created the dictionary file from the fasta and it is in my current working directory but GATK is looking for it in the /genomes/ directory. Can you make it so that you can specify the location of the dictionary on the command line? Or alternatively you could make it so that it looks for the dictionary in the current working directory

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Unfortunately, the .dict file has to be in the same directory as the .fa file.


  • Thanks for your reply. Our genomes are stored in /genomes so that they're not duplicated in everyone's /home directory wasting space. The /genomes directory is not user writable. Forcing the .dict file to be in the same directory as the .fa seems pretty inflexible. I hope you would consider changing this.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @d3abb7c9 That's not up to us -- this functionality comes from the htsjdk library, which is a project that falls under the samtools organization.

    You can solve this problem easily by creating a sequence dictionary in your /genomes directory so that it will be available for all users of your system. If you don't have admin rights to this directory, just let your sysadmin know that the .dict file is required for analysis and should be included in that directory. This is not an exotic requirement; other tools also make use of the sequence dictionary.

  • mbxat1mbxat1 NottinghamMember


    I try to run the "Realigner TargetCreator" but encountered this error message (Fasta dict file /home/mbxat1/African.cattle.project/Mapping/ for reference /home/mbxat1/African.cattle.project/Mapping/ does not exist). My files ".fa and .dict" are in one directory and so i dont know what could be wrong. your help will be appreciated.

    Thank you

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Can you tell me which version of GATK you are using? Also, please post your exact command line.


  • mbxat1mbxat1 NottinghamMember

    @ Sheila, thank you for your response.

    I am using the current GATK version (GenomeAnalysisTK-3.4-0). my command line as follows:

    java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/mbxat1/African.cattle.project/Bos_taurus.UMD3.1.dna.toplevel.fa -I ${DATOUT}/SampleKN002_mkdup.bam -o ${DATOUT}/SampleKN002_mkdup_intervals.list --filter_mismatching_base_and_quals --fix_misencoded_quality_scores -nt 4 2> >(tee "$logfile")

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Are you sure the dict file you have follows the required naming convention? It should have exactly the same name as the name mentioned in the error. If no, change the name to that.

    If yes, the other possibility is that the file is somehow damaged. Just delete it and create a new one.

  • mbxat1mbxat1 NottinghamMember

    @ Geraldine,

    thank you for your response. i might have been able to overcome the initial problem, i had to delete my reference file and download a new one from
    the error now is (Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool)

    any ideas please?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The mis-encoded base qualities error has been covered many times in the forum... Are you using the --fix_misencoded_quality_scores argument for a specific reason? Or is this a command you inherited from someone else?

  • mbxat1mbxat1 NottinghamMember

    you are right Geraldine, i inherited the command but i have removed the "--fix_misencoded_quality_scores" argument and I am able to proceeded without any error.

    Thank you for your time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    My advice is to always check the purpose of every argument when someone else gives you commands, to avoid problems like this, or version-related problems. Trust no one ;)

  • mbxat1mbxat1 NottinghamMember

    well noted, thank you

  • carrigjcarrigj DublinMember

    Hi, I keep getting this error and i don't know why, any help would be greatly appreciated
    " ERROR MESSAGE: Fasta index file /Users/joannecarrig1/Fabianii/combined_ref.fasta.fai for reference /Users/joannecarrig1/Fabianii/combined_ref.fasta does not exist."

    note the fasta file was indexed

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @carrigj Make sure the fasta index is in the same directory.

  • sumedhagargsumedhagarg CambridgeMember

    I have managed to create .fai file for my ref sequence but not .dict file, despite samtools running the command. What could be going wrong?

    cmd screenshot.png
    1234 x 726 - 139K
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sumedha,

    You will need to use Picard's CreateSequenceDictionary.


  • sumedhagargsumedhagarg CambridgeMember

    Thanks Sheila. Yes, it worked for a small reference file, but getting an error for much bigger file with whole human gDNA fasta file, at a particular line, as attached.

    CreateSeqDict error.png
    1221 x 359 - 47K
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sumedha,

    It looks like an issue with your index file. Can you try deleting it and re-indexing the reference?


  • sumedhagargsumedhagarg CambridgeMember


    Hi Sheila
    I tried that but had the same error again. My reference file is from ensembl ( Is there way to address this issue?

    Thanks again!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sumedha,

    Can you try unzipping the FASTA file? Maybe the issue is that the tools are not working with .gz files.


  • sumedhagargsumedhagarg CambridgeMember


    I am using unzipped file already.

  • sumedhagargsumedhagarg CambridgeMember

    @Sheila @Geraldine_VdAuwera
    Would it be possible for you to index this file for me that I could download? I am really stuck as can't proceed until I have it working.
    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Sorry, we can't provide that level of support for a reference file that we didn't produce ourselves.

    Try doing both the indexing and dictionary creation using Picard tools instead of samtools.
  • sumedhagargsumedhagarg CambridgeMember

    Thanks a lot for your advice. Please could you tell me the tool name for indexing fasta file in picard?

  • Will_GilksWill_Gilks University of Sussex, UKMember


    I use this code for making the various reference genome helper-files. Ideally the helper-files would only be made once by the same group that assembled the genome. This would prevent errors caused by people using different methods, and prevent someone having to spend time making their own files.

    ## Define variables
    ## Make index with BWA
        module load bio/1.15
        bwa index -a bwtsw ${my_fasta}
        module unload bio/1.15
    ## Create index with SAMtools
    module load samtools/1.0
    samtools faidx ${full_path}${my_fasta}
    module unload samtools/1.0
    ## Build Genome and Hash files with Stampy
        module load stampy/1.0.23 -G ${my_fasta} ${my_fasta} -g ${my_fasta} -H ${my_fasta}
        module unload stampy/1.0.23
    ## Create sequence dictionary with Picard tools. Note, this assumes fasta file suffix is .fa
        module load picard-tools/1.77
    CreateSequenceDictionary \
        R=  ${full_path}${my_fasta} \
        O= ${full_path}${my_fasta%.fa}.dict \
        module unload picard-tools/1.77
  • sumedhagargsumedhagarg CambridgeMember

    Thanks a lot for responding. I have a different issue now. Would you be able to help with that please? I have posted it at :

Sign In or Register to comment.