Decoding A Binary File

Our first task is to decode the file 7120db.dat, which contains binary data. (Binary, in computer jargon, means data that the computer can read, but that isn't in a format suitable for humans to look at.)

We're told that numeric data is preceeded by a single byte (maybe the letter "S") that flags it as a number, and that character data is preceeded by four bytes giving the length. We'll look at the file layout a bit later. First a digression: What's a byte? And why do we care?

A byte is a unit of computer storage, big enough to hold a letter like "S" or "a". These days computer memory is usually given in terms of megabytes (millions of bytes) for RAM, the part that goes away when you turn the computer off, and gigabytes (billions of bytes), for hard disks, which are permanent storage.

We'll be looking at much smaller units: for instance, the amount of storage computers use for a single number.

There are confusingly many ways for computers to store numbers. They can be held in a single byte (uncommon these days), in two bytes, in four bytes, or in eight bytes. The more bytes, the greater the range of numbers that can be stored. A single byte can store integers from 0 to 255; two bytes can store integers from 0 to 65535, and four bytes can store integers from 0 to a little over 4 million.

Another numeric format is called floating point. Unlike integers, floating point numbers can have fractions. They also have a wider range. A four-byte floating point number be as large as 10³⁸. Four-byte floating point numbers aren't as precise as four-byte integers -- there's no exact representation for 2147483548, for instance; the nearest four-byte floating point number is 2147483520.

Eight-byte floating point numbers (often called double-precision) are often used -- they can be as large as 10³⁰⁸, and they are as precise as four-byte integers.

Some languages have a different way to declare each different type of number. Perl doesn't -- it uses primarily eight-byte floating point numbers -- but sometimes we need to read files that other programs have written.

For that Perl provides the unpack function: ($a, $b, $c) = unpack "nNd", $buffer, for instance, will look at the (otherwise unprintable) information in $buffer and decode it. The letters "n", "N", and "d" are packing codes that tell Perl how to decode $buffer, according to the following table:

Code	Type of number
C	A one-byte integer (fits in one Character)
n	A two-byte integer
N	A four-byte iNteger
f	A four-byte floating point number
d	An eight-byte (double-precision) floating point number

Just as DNA can be read forward or in reverse, so can numbers. The n and N codes are for what are called big-endian numbers, where the most significant parts of the number are on the left ("big end first"). This is the way we write numbers by hand: in the number 2354, there are two thousands, three hundreds, five tens, and four ones. If we were little-endian we'd write that number in the reverse order: 4532 for four ones, five tens, three hundreds, and two thousands.

Code	Type of number
v	A two-byte integer in reverse order.
V	A four-byte integer in reVerse order.

It used to be that most computers used the big-endian order (N and n) rather than little-endian (V and v). These days, however, most personal computers (those with Intel processors) are little-endian (V and v). Macintosh computers are still big-endian.

There are also two different directions for floating point numbers. Perl is a little less help here: it only lets us read in such numbers in the format our local computer uses. So if we get a four-byte (f) or eight-byte (d) floating-point number that's in the wrong direction for our hardware, we may have to reverse it by hand.

There are several non-numeric packing codes that we'll also be using

Code	Action
x	Skip one byte
x2	Skip two bytes (and similarly for x17 or another number)
a	A single-character string
a2	A two-character string. (and similarly for a17, etc.)
A7	A seven-character string. Trim any blanks from the end
A	A single-character string, or the zero-length string.

The difference between "a7" and "A7" is that in

my ($first, $last) = unpack("a7a10","Pat    Pending   ")

will set $first to "Pat " (with four blanks at the end) and $last to "Pending ", with three blanks at the end. On the other hand,

Description of 7120db.dat

Field	Bytes	Length	type
OrfName	4 + 11	15	Char
OrfContig	4 + 5	9	Char
OrfLeft	1 + 8	9	Num
OrfRight	1 + 8	9	Num
OrfDirection	4 + 1	5	Char
OrfAccession	4 + 14	18	Char
OrfPct	4 + 5	9	Char
OrfEval	1 + 8	9	Num
OrfDescr	4 + 50	54	Char
TOTAL		230

There are several fields after OrfDescr, which make up the rest of the 230 bytes, but which we'll otherwise ignore.

There are two kinds of data, numbers and character (string) data. We're told that numeric data is preceeded by a single byte (maybe the letter "S") that flags it as a number, and that character data is preceeded by four bytes giving the length.

We're interested in just five of these fields: OrfName (the short name of the ORF), OrfLeft and OrfRight (the co-ordinates of the left and right ends of the ORF), OrfDirection ("d" for direction and "c" for reverse complement), and OrfDescr, information about the ORF intended for humans. We might or might not need the lengths of the character fields; we'll try reading them in to see what information they give.

SQ1: We might want to reading the information just mentioned using a statement like the following:

   my ($name_length, $orf_name, $orf_left, $orf_right, $orf_direction, $descr_length, $orf_descr)
       = unpack(..., $buffer);

($name_length and $descr_length are the lengths given in 7120db.dat for $orf_name and $orf_descr respectively. We don't read in the length of $orf_direction, since that is supposed to be always exactly one character long.)

We need to replace the ... with a string of packing codes that tell Perl how to decode the information in $buffer. Since we don't have information on the direction (big-endian or little-endian) of the numbers, we'll have to pick one or the other and try it, planning to switch to the other direction if necessary. For the following questions, use the numeric packing codes that agree with the direction that most personal computers use.

SQ1a: What are good candidate packing strings for $name_length and $descr_length?

SQ1b: What are good candidate packing strings for $orf_name, $orf_direction, and $orf_descr?

SQ1c: What are good candidate packing strings for $orf_left and $orf_right?

SQ1d: What packing string should we use to replace ... ?

See the next page for answers.