The File Decoded
One way to do it is this: first read in the eight bytes with no
decoding, using the a8
packing code. That give us a string of eight bytes. Reverse
the bytes (with reverse), and
decode the reversed string using the d
packing code. Something like this:
while (read ORF_DATA, $buffer, $record_length) {
my ($orf_name, $orf_left, $orf_right, $orf_direction, $orf_descr)
= unpack("x5x4A11x4x4x1a8x1a8x4A1x4x14x4x5x1x8x4A50", $buffer);
my $reverse_left = reverse($orf_left);
$orf_left = unpack("d", $reverse_left);
my $reverse_right = reverse($orf_right);
$orf_right = unpack("d", $reverse_right);
print "$orf_name, $orf_left, $orf_right, $orf_direction, $orf_descr\n";
}
close ORF_DATA;
And sure enough, we get:
all0001, -311, 918, c, unknown protein
all0002, 981, 1718, c, unknown protein
asl0003, 2617, 2805, c, unknown protein
arl5500, 2861, 3247, c, ssrA: 10Sa RNA
all0004, 3418, 4365, c, AtpC: ATP synthase subunit gamma
A more succinct way of writing
my $reverse_left = reverse($orf_left);
$orf_left = unpack("d", $reverse_left);
my $reverse_right = reverse($orf_right);
$orf_right = unpack("d", $reverse_right);
print "$orf_name, $orf_left, $orf_right, $orf_direction, $orf_descr\n";
is the following:
foreach my $coordinate ($orf_left, $orf_right) {
$coordinate = unpack("d", reverse($coordinate));
}
Here's the complete program to display the fields we're interested in:
#!/usr/bin/perl -w
use strict;
dump_data("7120DB.DAT");
sub dump_data {
my ($orf_file) = @_;
my $record_length = 230;
open ORF_DATA, "<$orf_file" or die "Can't open $orf_file: $!\n";
binmode ORF_DATA; # Tell Perl this isn't a text file
my $buffer;
while (read ORF_DATA, $buffer, $record_length) {
my ($orf_name, $orf_left, $orf_right, $orf_direction, $orf_descr)
= unpack("x5x4A11x4x4x1a8x1a8x4A1x4x14x4x5x1x8x4A50", $buffer);
foreach my $coordinate ($orf_left, $orf_right) {
$coordinate = unpack("d", reverse($coordinate));
}
print "$orf_name, $orf_left, $orf_right, $orf_direction, $orf_descr\n";
}
close ORF_DATA;
}
If we run it, at the very end we'll see the following puzzling line:
x outside of string at print-db-7.pl line 13.
This is a symptom of a short line: Perl has tried to skip, with the x
packing code, past the end of $buffer.
That is, having read some number of 230-byte records from 7120DB.DAT, there's something
left over; the file is not evenly divisible by 230.
The length of the file is 1425775. A small Perl program, or some work
with a calculator, tell us that 230 times 6119 is 1425770; in
other words, there are five extra bytes in the file.
Hmmm... five extra bytes. We previously decided there was an
extra five-byte field at the beginning of each 230-byte record. Maybe
instead of five bytes at the beginning of each record there are five
extra bytes at the beginning of the entire file; five extra bytes and
no more.
We can add the line
read ORF_DATA, $buffer, 5
to fix the problem.
SQ8: Where should we add the
line?