This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
GATK bundle fai files do not work with samtools
I've had multiple Strelka user issues recently reported which were traced back to non-standard fai files found in the GATK bundle. The problem appears to be that some GATK bundle fai files contain spaces in the first column used for the contig name, for instance (with tabs entered as \t for clarity):
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1\t249250621\t52\t60\t61
In this case the first column value of "1 dna:chromosome chromosome:GRCh37:1:1:249250621:1" causes samtools (0.1.18) to crash when dealing with this file. Note that this same line would be accepted in the header of a fasta file, but not in the first column of a fasta index. This also causes tools like strelka which heavily use libbam to fail. Taking the NCBI v37 fasta files supplied in the GATK bundle and running "samtools faidx human_g1k_v37_decoy.fasta" produces:
$ head -1 human_g1k_v37_decoy.fasta.fai
The problematic fai files appear associated with all of the NCBI v37 fastas:
Can the fai files be corrected and the bundles updated? I'd greatly appreciate the problems this would reduce for folks using samtools based software.