The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Script for sorting an input file based on a reference (SortByRef.pl)

edited October 2012

This script can be used for sorting an input file based on a reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

print "\nUsage:\n";
print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";

print " Sorts lines of the input file INFILE according\n";
print " to the reference contig order specified by the\n";
print " reference dictionary REF_DICT (.fai file).\n";
print " The sort is stable. If -k option is not specified,\n";
print " it is assumed that the contig name is the first\n";
print " field in each line.\n\n";
print "  INPUT      input file to sort. If '-' is specified, \n";
print "             then reads from STDIN.\n";
print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
print "             desired soting order, as its first column.\n";
print "  --k POS :  contig name is in the field POS (1-based)\n";
print "             of input lines.\n\n";

exit(1);
}

my $pos = 1; GetOptions( "k:i" => \$pos );

$pos--; usage() if ( scalar(@ARGV) == 0 ); if ( scalar(@ARGV) != 2 ) { print "Wrong number of arguments\n"; usage(); } my$input_file = $ARGV[0]; my$dict_file = $ARGV[1]; open(DICT, "<$dict_file") or die("Can not open $dict_file:$!");

my %ref_order;

my $n = 0; while ( <DICT> ) { chomp; my ($contig, $rest) = split "\t"; die("Dictionary file is probably corrupt: multiple instances of contig$contig") if ( defined $ref_order{$contig} );

$ref_order{$contig} = $n;$n++;
}

close DICT;
#we have loaded contig ordering now

my $INPUT; if ($input_file eq "-" ) {
$INPUT = "STDIN"; } else { open($INPUT, "< $input_file") or die("Can not open$input_file: $!"); } my %temp_outputs; while ( <$INPUT> ) {

my @fields = split '\s';
die("Specified field position exceeds the number of fields:\n$_") if ($pos >= scalar(@fields) );

my $contig =$fields[$pos]; if ($contig =~ m/:/ ) {
my @loc = split(/:/, $contig); # print$contig . " " . $loc[0] . "\n";$contig = $loc[0] } chomp$contig if ( $pos == scalar(@fields) - 1 ); # if last field in line my$order;
if ( defined $ref_order{$contig} ) { $order =$ref_order{$contig}; } else {$order = $n; # input line has contig that was not in the dict;$n++; # this contig will go at the end of the output,
# after all known contigs
}

my $fhandle; if ( defined$temp_outputs{$order} ) {$fhandle = $temp_outputs{$order} }
else {
#print "opening $order $$_\n"; open( fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
die ( "Can not open temporary file $order:$!");
$temp_outputs{$order} = $fhandle; } # we got the handle to the temp file that keeps all # lines with contig$contig

print $fhandle$_; # send current line to its corresponding temp file
}

close $INPUT; foreach my$f ( values %temp_outputs ) { close $f; } # now collect back into single output stream: for ( my$i = 0 ; $i <$n ; $i++ ) { # if we did not have any lines on contig$i, then there's
# no temp file and nothing to do
next if ( ! defined $temp_outputs{$i} ) ;

my $f; open ($f, "< /tmp/sortByRef.$$.i.tmp" ); while ( <f> ) { print ; } close f; unlink "/tmp/sortByRef.$$.\$i.tmp";

}

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 12Member

Hi Geraldine,

I want to create a sorted RefSeq file and I followed the following process:

cat refSeq_hg19.txt | sort -n -k5,5 -T /home/fernandr/ | perl /usr/local/biotools/GenomeAnalysisTK-2.2-13/SortByRef.pl --k 3 - /home/fernandr/biotools/references/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai > refSeq_hg19.igenomes.sorted.txt

However, I get the following error:

Can not open temporary file 1027: Too many open files at /usr/local/biotools/GenomeAnalysisTK-2.2-13/SortByRef.pl line 95, line 2355.

Any ideas of what I may be doing wrong?

Cheers,

Fernando

Looks like a limitation of your filesystem. Are you working with draft genomes that contain many contigs?

Geraldine Van der Auwera, PhD

• Posts: 166Member ✭✭✭
edited January 2013

Yeah, I agree with Geraldine. looks like your /etc/security/limits.conf for your user is set to 1024 (usually that's the default I think). If that refSeq_hg19.txt is refseq from UCSC it will probably contain about 40,000 rows. I changed my soft and hard cap to 60,000 in order to perform the sorting. e.g.

me        soft    nofile  60000
me        hard    nofile  60000


you'll probably need your sysadmin to do it for you (the file is only modify-able by root/admin).

Post edited by Geraldine_VdAuwera on
• Posts: 166Member ✭✭✭

sorry, the "me soft nofile 60000" and "me hard nofile 60000" are supposed to be 2 separate lines. I'll never get used to this markup or markdown or whatever it is.

Fixed that for you, I hope you don't mind. You can always hit the "C" icon in the top left corner of the comment box to invoke basic code formatting.

Geraldine Van der Auwera, PhD

• Posts: 166Member ✭✭✭

Ahhhh, oh button icons. Thanks!

• Posts: 2Member

Hi everybody,

I had some troubles to get a refSeq file with the right format for GATK mainly because of the order of the lines. The easiest solution (and the only one which worked for my files) is to get the refSeq file from the UCSC site and then to open it with OpenOffice to sort it by chr and start position. You need to choose "text" as the column type for the strand and check the "natural order" in the option tab of the sort window and that's OK. I tried the sort command on the 3rd and 5th columns and the perl script many times but never ended to the right format. The OpenOffice solution is really more efficient and quick in my case. Hope it can be useful to other users.

Olivia