Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Script for sorting an input file based on a reference (SortByRef.pl)

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin
edited October 2012 in Methods and Workflows

This script can be used for sorting an input file based on a reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

    print "\nUsage:\n";
    print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";

    print " Sorts lines of the input file INFILE according\n";
    print " to the reference contig order specified by the\n";
    print " reference dictionary REF_DICT (.fai file).\n";
    print " The sort is stable. If -k option is not specified,\n";
    print " it is assumed that the contig name is the first\n";
    print " field in each line.\n\n";
    print "  INPUT      input file to sort. If '-' is specified, \n";
    print "             then reads from STDIN.\n";
    print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
    print "             desired soting order, as its first column.\n";
    print "  --k POS :  contig name is in the field POS (1-based)\n";
    print "             of input lines.\n\n";

    exit(1);
}

my $pos = 1;
GetOptions( "k:i" => \$pos );

$pos--;

usage() if ( scalar(@ARGV) == 0 );

if ( scalar(@ARGV) != 2 ) {
    print "Wrong number of arguments\n";
    usage();
}

my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];


open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");

my %ref_order;

my $n = 0;
while ( <DICT> ) {
    chomp;
    my ($contig, $rest) = split "\t";
    die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );

    $ref_order{$contig} = $n;
    $n++;
}

close DICT;
#we have loaded contig ordering now

my $INPUT;
if ($input_file eq "-" ) {
    $INPUT = "STDIN";
} else {
    open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}

my %temp_outputs;

while ( <$INPUT> ) {

    my @fields = split '\s';
    die("Specified field position exceeds the number of fields:\n$_") 
        if ( $pos >= scalar(@fields) );

    my $contig = $fields[$pos];
    if ( $contig =~ m/:/ ) {
        my @loc = split(/:/, $contig);
        # print $contig . " " . $loc[0] . "\n";
        $contig = $loc[0]
    }
    chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line

    my $order;
    if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
    else {
        $order = $n; # input line has contig that was not in the dict; 
        $n++; # this contig will go at the end of the output, 
              # after all known contigs
    }

    my $fhandle;
    if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
    else {
        #print "opening $order $$ $_\n";
        open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
            die ( "Can not open temporary file $order: $!");
        $temp_outputs{$order} = $fhandle;
    }

    # we got the handle to the temp file that keeps all 
    # lines with contig $contig

    print $fhandle $_; # send current line to its corresponding temp file
}

close $INPUT;

foreach my $f ( values %temp_outputs ) { close $f; }

# now collect back into single output stream:

for ( my $i = 0 ; $i < $n ; $i++ ) {
    # if we did not have any lines on contig $i, then there's 
    # no temp file and nothing to do
    next if ( ! defined $temp_outputs{$i} ) ; 

    my $f; 
    open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
    while ( <$f> ) { print ; }
    close $f;

    unlink "/tmp/sortByRef.$$.$i.tmp";


}
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • fjrossellofjrossello Posts: 12Member

    Hi Geraldine,

    I want to create a sorted RefSeq file and I followed the following process:

    cat refSeq_hg19.txt | sort -n -k5,5 -T /home/fernandr/ | perl /usr/local/biotools/GenomeAnalysisTK-2.2-13/SortByRef.pl --k 3 - /home/fernandr/biotools/references/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai > refSeq_hg19.igenomes.sorted.txt

    However, I get the following error:

    Can not open temporary file 1027: Too many open files at /usr/local/biotools/GenomeAnalysisTK-2.2-13/SortByRef.pl line 95, line 2355.

    Any ideas of what I may be doing wrong?

    Thanks in advance.

    Cheers,

    Fernando

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Looks like a limitation of your filesystem. Are you working with draft genomes that contain many contigs?

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 110Member ✭✭✭
    edited January 2013

    Yeah, I agree with Geraldine. looks like your /etc/security/limits.conf for your user is set to 1024 (usually that's the default I think). If that refSeq_hg19.txt is refseq from UCSC it will probably contain about 40,000 rows. I changed my soft and hard cap to 60,000 in order to perform the sorting. e.g.

    me        soft    nofile  60000
    me        hard    nofile  60000
    

    you'll probably need your sysadmin to do it for you (the file is only modify-able by root/admin).

    Post edited by Geraldine_VdAuwera on
  • KurtKurt Posts: 110Member ✭✭✭

    sorry, the "me soft nofile 60000" and "me hard nofile 60000" are supposed to be 2 separate lines. I'll never get used to this markup or markdown or whatever it is.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Fixed that for you, I hope you don't mind. You can always hit the "C" icon in the top left corner of the comment box to invoke basic code formatting.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 110Member ✭✭✭

    Ahhhh, oh button icons. Thanks!

  • oliviajmoliviajm Posts: 2Member

    Hi everybody,

    I had some troubles to get a refSeq file with the right format for GATK mainly because of the order of the lines. The easiest solution (and the only one which worked for my files) is to get the refSeq file from the UCSC site and then to open it with OpenOffice to sort it by chr and start position. You need to choose "text" as the column type for the strand and check the "natural order" in the option tab of the sort window and that's OK. I tried the sort command on the 3rd and 5th columns and the perl script many times but never ended to the right format. The OpenOffice solution is really more efficient and quick in my case. Hope it can be useful to other users.

    Olivia

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Thanks for sharing your solution, Olivia.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.