Our first task is to decode the file 7120db.dat, which contains binary data. (Binary, in
computer jargon, means data that the computer can read, but that isn't
in a format suitable for humans to look at.)
We're told that numeric data is preceeded by a single byte (maybe the
letter "S") that flags it as a number, and that character data is
preceeded by four bytes giving the length. We'll look at the file layout
a bit later. First a digression: What's a byte? And why do
we care?
A byte is a unit of computer storage, big enough to hold a letter like "S" or "a". These days computer memory is usually given in terms of megabytes (millions of bytes) for RAM, the part that goes away when you turn the computer off, and gigabytes (billions of bytes), for hard disks, which are permanent storage.
We'll
be looking at much smaller units: for instance, the amount of storage
computers use for a single number.
There
are confusingly many ways for computers to store numbers. They
can be held in a single byte (uncommon these days), in two bytes, in
four bytes, or in eight bytes. The more bytes, the greater the
range of numbers that can be stored. A single byte can store
integers from 0 to 255; two bytes can store integers from 0 to 65535,
and four bytes can store integers from 0 to a little over 4 million.
Another
numeric format is called floating
point. Unlike integers, floating point numbers can have
fractions. They also have a wider range. A four-byte
floating point number be as large as 1038. Four-byte
floating point numbers aren't as precise as four-byte integers --
there's no exact representation for 2147483548, for
instance; the nearest four-byte floating point number is 2147483520.
Eight-byte
floating point numbers (often called double-precision)
are often used -- they can be as large as 10308, and they
are as precise as four-byte integers.
Some languages have a different way to declare each different type of number. Perl doesn't -- it uses primarily eight-byte floating point numbers -- but sometimes we need to read files that other programs have written.
For
that Perl provides the unpack function: ($a, $b, $c) = unpack "nNd", $buffer,
for instance, will look at
the (otherwise unprintable) information in $buffer and decode it.
The letters "n", "N", and "d" are packing codes that tell Perl how
to decode $buffer, according to the following table:
Code |
Type
of number |
C |
A one-byte integer (fits in one Character) |
n |
A two-byte integer |
N |
A four-byte iNteger |
f |
A four-byte floating point number |
d |
An eight-byte (double-precision) floating point number |
Just
as DNA can be read forward or in reverse, so can numbers. The
n and N codes are for what are called
big-endian numbers, where the most significant parts of the number
are on the left ("big end first"). This is the way we write
numbers by hand: in the number 2354, there are two thousands, three
hundreds, five tens, and four ones. If we were little-endian we'd write
that number in the reverse order: 4532 for four ones, five tens, three
hundreds, and two thousands.
Code |
Type
of number |
v |
A two-byte integer in reverse order. |
V | A four-byte integer in reVerse order. |
There
are several non-numeric packing codes that we'll also be using
Code |
Action |
x |
Skip one byte |
x2 | Skip two bytes (and similarly for x17 or another number) |
a |
A single-character string |
a2 |
A two-character string. (and
similarly for a17, etc.) |
A7 |
A seven-character string. Trim
any blanks from the end |
A |
A single-character string, or the zero-length string. |
The difference between "a7" and "A7" is that in
my ($first, $last) = unpack("a7a10","Pat Pending ")will set $first to "Pat " (with four blanks at the end) and $last to "Pending ", with three blanks at the end. On the other hand,
Field |
Bytes |
Length |
type |
OrfName |
4 + 11 |
15 |
Char |
OrfContig |
4 + 5 |
9 |
Char |
OrfLeft |
1 + 8 |
9 |
Num |
OrfRight |
1 + 8 |
9 |
Num |
OrfDirection |
4 + 1 |
5 |
Char |
OrfAccession |
4 + 14 |
18 |
Char |
OrfPct |
4 + 5 |
9 |
Char |
OrfEval |
1 + 8 |
9 |
Num |
OrfDescr |
4 + 50 |
54 |
Char |
TOTAL |
|
230 |
|
my ($name_length, $orf_name, $orf_left, $orf_right, $orf_direction, $descr_length, $orf_descr)($name_length and $descr_length are the lengths given in 7120db.dat for $orf_name and $orf_descr respectively. We don't read in the length of $orf_direction, since that is supposed to be always exactly one character long.)
= unpack(..., $buffer);