It looks like you're new here. If you want to get involved, click one of these buttons!
Geraldine_VdAuwera
Posts: 2,239Administrator, GSA Official Member admin
This script can be used for sorting an input file based on a reference.
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
sub usage {
print "\nUsage:\n";
print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";
print " Sorts lines of the input file INFILE according\n";
print " to the reference contig order specified by the\n";
print " reference dictionary REF_DICT (.fai file).\n";
print " The sort is stable. If -k option is not specified,\n";
print " it is assumed that the contig name is the first\n";
print " field in each line.\n\n";
print " INPUT input file to sort. If '-' is specified, \n";
print " then reads from STDIN.\n";
print " REF_DICT .fai file, or ANY file that has contigs, in the\n";
print " desired soting order, as its first column.\n";
print " --k POS : contig name is in the field POS (1-based)\n";
print " of input lines.\n\n";
exit(1);
}
my $pos = 1;
GetOptions( "k:i" => \$pos );
$pos--;
usage() if ( scalar(@ARGV) == 0 );
if ( scalar(@ARGV) != 2 ) {
print "Wrong number of arguments\n";
usage();
}
my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];
open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");
my %ref_order;
my $n = 0;
while ( <DICT> ) {
chomp;
my ($contig, $rest) = split "\t";
die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );
$ref_order{$contig} = $n;
$n++;
}
close DICT;
#we have loaded contig ordering now
my $INPUT;
if ($input_file eq "-" ) {
$INPUT = "STDIN";
} else {
open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}
my %temp_outputs;
while ( <$INPUT> ) {
my @fields = split '\s';
die("Specified field position exceeds the number of fields:\n$_")
if ( $pos >= scalar(@fields) );
my $contig = $fields[$pos];
if ( $contig =~ m/:/ ) {
my @loc = split(/:/, $contig);
# print $contig . " " . $loc[0] . "\n";
$contig = $loc[0]
}
chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line
my $order;
if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
else {
$order = $n; # input line has contig that was not in the dict;
$n++; # this contig will go at the end of the output,
# after all known contigs
}
my $fhandle;
if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
else {
#print "opening $order $$ $_\n";
open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
die ( "Can not open temporary file $order: $!");
$temp_outputs{$order} = $fhandle;
}
# we got the handle to the temp file that keeps all
# lines with contig $contig
print $fhandle $_; # send current line to its corresponding temp file
}
close $INPUT;
foreach my $f ( values %temp_outputs ) { close $f; }
# now collect back into single output stream:
for ( my $i = 0 ; $i < $n ; $i++ ) {
# if we did not have any lines on contig $i, then there's
# no temp file and nothing to do
next if ( ! defined $temp_outputs{$i} ) ;
my $f;
open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
while ( <$f> ) { print ; }
close $f;
unlink "/tmp/sortByRef.$$.$i.tmp";
}
Geraldine Van der Auwera, PhD
Comments
Hi Geraldine,
I want to create a sorted RefSeq file and I followed the following process:
cat refSeq_hg19.txt | sort -n -k5,5 -T /home/fernandr/ | perl /usr/local/biotools/GenomeAnalysisTK-2.2-13/SortByRef.pl --k 3 - /home/fernandr/biotools/references/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai > refSeq_hg19.igenomes.sorted.txt
However, I get the following error:
Can not open temporary file 1027: Too many open files at /usr/local/biotools/GenomeAnalysisTK-2.2-13/SortByRef.pl line 95, line 2355.
Any ideas of what I may be doing wrong?
Thanks in advance.
Cheers,
Fernando
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Looks like a limitation of your filesystem. Are you working with draft genomes that contain many contigs?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Yeah, I agree with Geraldine. looks like your /etc/security/limits.conf for your user is set to 1024 (usually that's the default I think). If that refSeq_hg19.txt is refseq from UCSC it will probably contain about 40,000 rows. I changed my soft and hard cap to 60,000 in order to perform the sorting. e.g.
you'll probably need your sysadmin to do it for you (the file is only modify-able by root/admin).
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •sorry, the "me soft nofile 60000" and "me hard nofile 60000" are supposed to be 2 separate lines. I'll never get used to this markup or markdown or whatever it is.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Fixed that for you, I hope you don't mind. You can always hit the "C" icon in the top left corner of the comment box to invoke basic
codeformatting.Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Ahhhh, oh button icons. Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •