Errors about contigs in BAM or VCF files not being properly sorted

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,737Administrator, GATK Dev admin
edited May 26 in Common Problems

This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order.

So what do you do?

For BAM files

You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:

java -jar picard.jar ReorderSam I=original.bam R=reference.fasta O=reordered.bam

Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file.

For VCF files

You run this handy little script on the VCF to re-sort it to match the reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

    print "\nUsage:\n";
    print " [--k POS] INPUT REF_DICT\n\n";

    print " Sorts lines of the input file INFILE according\n";
    print " to the reference contig order specified by the\n";
    print " reference dictionary REF_DICT (.fai file).\n";
    print " The sort is stable. If -k option is not specified,\n";
    print " it is assumed that the contig name is the first\n";
    print " field in each line.\n\n";
    print "  INPUT      input file to sort. If '-' is specified, \n";
    print "             then reads from STDIN.\n";
    print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
    print "             desired sorting order, as its first column.\n";
    print "  --k POS :  contig name is in the field POS (1-based)\n";
    print "             of input lines.\n\n";


my $pos = 1;
GetOptions( "k:i" => \$pos );


usage() if ( scalar(@ARGV) == 0 );

if ( scalar(@ARGV) != 2 ) {
    print "Wrong number of arguments\n";

my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];

open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");

my %ref_order;

my $n = 0;
while ( <DICT> ) {
    my ($contig, $rest) = split "\t";
    die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );

    $ref_order{$contig} = $n;

close DICT;
#we have loaded contig ordering now

my $INPUT;
if ($input_file eq "-" ) {
    $INPUT = "STDIN";
} else {
    open($INPUT, "< $input_file") or die("Can not open $input_file: $!");

my %temp_outputs;

while ( <$INPUT> ) {

    my @fields = split '\s';
    die("Specified field position exceeds the number of fields:\n$_") 
        if ( $pos >= scalar(@fields) );

    my $contig = $fields[$pos];
    if ( $contig =~ m/:/ ) {
        my @loc = split(/:/, $contig);
        # print $contig . " " . $loc[0] . "\n";
        $contig = $loc[0]
    chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line

    my $order;
    if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
    else {
        $order = $n; # input line has contig that was not in the dict; 
        $n++; # this contig will go at the end of the output, 
              # after all known contigs

    my $fhandle;
    if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
    else {
        #print "opening $order $$ $_\n";
        open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
            die ( "Can not open temporary file $order: $!");
        $temp_outputs{$order} = $fhandle;

    # we got the handle to the temp file that keeps all 
    # lines with contig $contig

    print $fhandle $_; # send current line to its corresponding temp file

close $INPUT;

foreach my $f ( values %temp_outputs ) { close $f; }

# now collect back into single output stream:

for ( my $i = 0 ; $i < $n ; $i++ ) {
    # if we did not have any lines on contig $i, then there's 
    # no temp file and nothing to do
    next if ( ! defined $temp_outputs{$i} ) ; 

    my $f; 
    open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
    while ( <$f> ) { print ; }
    close $f;

    unlink "/tmp/sortByRef.$$.$i.tmp";

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD


  • fjrossellofjrossello Posts: 12Member

    Hi Geraldine,

    I want to create a sorted RefSeq file and I followed the following process:

    cat refSeq_hg19.txt | sort -n -k5,5 -T /home/fernandr/ | perl /usr/local/biotools/GenomeAnalysisTK-2.2-13/ --k 3 - /home/fernandr/biotools/references/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai > refSeq_hg19.igenomes.sorted.txt

    However, I get the following error:

    Can not open temporary file 1027: Too many open files at /usr/local/biotools/GenomeAnalysisTK-2.2-13/ line 95, line 2355.

    Any ideas of what I may be doing wrong?

    Thanks in advance.



  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,737Administrator, GATK Dev admin

    Looks like a limitation of your filesystem. Are you working with draft genomes that contain many contigs?

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 225Member ✭✭✭
    edited January 2013

    Yeah, I agree with Geraldine. looks like your /etc/security/limits.conf for your user is set to 1024 (usually that's the default I think). If that refSeq_hg19.txt is refseq from UCSC it will probably contain about 40,000 rows. I changed my soft and hard cap to 60,000 in order to perform the sorting. e.g.

    me        soft    nofile  60000
    me        hard    nofile  60000

    you'll probably need your sysadmin to do it for you (the file is only modify-able by root/admin).

    Post edited by Geraldine_VdAuwera on
  • KurtKurt Posts: 225Member ✭✭✭

    sorry, the "me soft nofile 60000" and "me hard nofile 60000" are supposed to be 2 separate lines. I'll never get used to this markup or markdown or whatever it is.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,737Administrator, GATK Dev admin

    Fixed that for you, I hope you don't mind. You can always hit the "C" icon in the top left corner of the comment box to invoke basic code formatting.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 225Member ✭✭✭

    Ahhhh, oh button icons. Thanks!

  • oliviajmoliviajm Posts: 2Member

    Hi everybody,

    I had some troubles to get a refSeq file with the right format for GATK mainly because of the order of the lines.
    The easiest solution (and the only one which worked for my files) is to get the refSeq file from the UCSC site and then to open it with OpenOffice to sort it by chr and start position. You need to choose "text" as the column type for the strand and check the "natural order" in the option tab of the sort window and that's OK.
    I tried the sort command on the 3rd and 5th columns and the perl script many times but never ended to the right format.
    The OpenOffice solution is really more efficient and quick in my case.
    Hope it can be useful to other users.


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,737Administrator, GATK Dev admin

    Thanks for sharing your solution, Olivia.

    Geraldine Van der Auwera, PhD

  • meharmehar Posts: 87Member

    Hi all,

    I have a similar issue with number for number of temporary files open while using the script

    The ulimit is set to a maximum of 65K which cannot be raised further and the script complains:

    Can not open temporary file 7544: Too many open files at ./ line 95, <$INPUT> line 2045.>

    Could anyone give some tips to overcome this? Could any modification to the script help?

    Any help is valuable.

  • alejandraalejandra spainPosts: 7Member

    Hi. I had exactly the same problem with

    "Can not open temporary file 1475: Too many open files at line 93, <$INPUT> line 1021."

    Did you find the solution?

  • alejandraalejandra spainPosts: 7Member

    I didn't completely fix it but here is what I did and I got the BaseCalibrator to work:
    sort the bam file with ReorderSam (Picard)
    index the new ordered file with samtools

    download the file from and run:
    perl reference.dict dbSNP.vcf > .dbSnp_vcfSorter.vcf

Sign In or Register to comment.