A First Attempt

First answers to the study questions. Here is the file description again for reference:

Description of 7120db.dat

Field	Bytes	Length	type
OrfName	4 + 11	15	Char
OrfContig	4 + 5	9	Char
OrfLeft	1 + 8	9	Num
OrfRight	1 + 8	9	Num
OrfDirection	4 + 1	5	Char
OrfAccession	4 + 14	18	Char
OrfPct	4 + 5	9	Char
OrfEval	1 + 8	9	Num
OrfDescr	4 + 50	54	Char
TOTAL		230

SQ1: For the following questions, use the numeric packing codes that agree with the direction that most personal computers use.

SQ1a: What are good candidate packing strings for $name_length and $descr_length?

Answer: These are the lengths of the strings, and the table tells us that each length is four bytes long. Since the length of $orf_descr, for instance, is presumably between 0 and 50, and doesn't have any fractions, one of the integer packing codes should fit: n,N,v, or V. Only N and V fit a four-byte number. We don't yet know which is better. The question asked us to choose the direction most commonly used, so we choose V.

SQ1b: What are good candidate packing strings for $orf_name, $orf_direction, and $orf_descr?

Answer: Either a or A will work: a11 or A11 for $orf_name, a1 or A1 (or just a or A) for $orf_direction, and a50 or A50 for $orf_descr. At this stage we might have a slight preference for the lowercase a versions, since they return the exact string that is in the file. After we're sure we're on the right track it may be more convenient to use the uppercase A versions.

SQ1c: What are good candidate packing strings for $orf_left and $orf_right?

Answer: The numbers are eight bytes long, so the only thing that fits is d, for a double-precision floating point number. That may be a bit of overkill for ORF co-ordinates, which are realtively small numbers (in the millions rather than billions or higher). But the program that produced this file apparently just uses one type of number, and double-precision floating point is the best all-around for that.

So one good answer is d. Another good answer is xd, which accounts also for the flag character that preceeds the number.

SQ1d: What packing string should we use to replace ... ?

Here is a table giving the fields we want to read in (and some we don't care about), the variable or variables for that field, and the packing codes.

Field	First Variable	First Code	Second Variable	Second Code
OrfName	$name_length	V	$orf_length	a11
OrfContig	(skip)	x4	(skip)	x5
OrfLeft	(skip)	x1	$orf_left	d
OrfRight	(skip)	x1	$orf_right	d
OrfDirection	(skip)	x4	$orf_direction	a1
OrfAccession	(skip)	x4	(skip)	x14
OrfPct	(skip)	x4	(skip)	x5
OrfEval	(skip)	x1	(skip)	x8
OrfDescr	$descr_length	V	$orf_descr	a50

When we put all the packing codes together we get "Va11x4x5x1dx1dx4a1x4x14x4x5x1x8Va50". We could join adjacent x fields to make a shorter string:"Va11x10dx1dx4a1x36Va50".

Trying it out

Before we put this code into a larger program we should test it out to make sure it's working. Here's the skeleton of a program to read a binary file:

#!/usr/bin/perl -w
use strict;

dump_data("7120DB.DAT");

sub dump_data {
   my ($orf_file) = @_;
   my $record_length = 230;
   open ORF_DATA, "<$orf_file" or die "Can't open $orf_file: $!\n"; 
   binmode ORF_DATA;  # Tell Perl this isn't a text file
   my $buffer;
   while (read ORF_DATA, $buffer, $record_length) {
      # do something...
   }
   close ORF_DATA;
}

Two new things are the use of the binmode command, which lets Perl know that we intend to read binary data from the file, and the command read ORF_DATA, $buffer, $record_length. Binary files are usually organized into records, each record having exactly the same length. (By contrast, a text file may have lines of many different lengths.) The read command moves exactly $record_length bytes from the ORF_DATA f ile into $buffer. (Except at the very end of the file, if there are fewer than $record_length bytes left).

If we replace # do something... with our unpacking code we get

   while (read ORF_DATA, $buffer, $record_length) {
      my ($name_length, $orf_name, $orf_left, $orf_right,
          $orf_direction, $descr_length, $orf_descr)
       = unpack("Va11x4x5x1dx1dx4a1x4x14x4x5x1x8Va50", $buffer);
      print "$name_length, $orf_name, $orf_left, $orf_right,
          $orf_direction, $descr_length, $orf_descr\n";
   }

Unfortunately, when we run this the result is pretty forbidding:

   22740, æS^@^@^Kall000, 1.46429023063736e-306, 2.35649294533797e-305,
             S, 4294967295, S^@^@2unknown protein                               
   4294967295, ES^@^@^Kall000, 1.74842164944403e-305, 1.40140740775937e-304,
             S, 4294967295, S^@^@2unknown protein                               
   4294967295, ES^@^@^Kasl000, 1.65915802715193e-306, 3.0355575411186e-304,
             S, 4294967295, S^@^@2unknown protein                               
   4294967295, ES^@^@^Karl550, 5.92969249363755e-307, 6.82232954380406e-307,
             S, 4294967295, S^@^@2ssrA: 10Sa RNA                                
   4294967295, ES^@^@^Kall000, 2.94279990811978e-305, 1.90416083759353e-308,
             S, 4294967295, S^@^@2AtpC: ATP synthase subunit gamma
   ...

Some of the characters in the strings are unprintable. (You may see slightly different output, depending on how your computer tries to print the unprintable). The integers used for string length seem too big; the floating point numbers are tiny fractions.

What to do?

The easiest thing is to try big-endian numbers instead of little-endian:"Na11x10dx1dx4a1x36Na50". But that doesn't help:

   3562536960, æS^@^@^Kall000, 1.46429023063736e-306, 2.35649294533797e-305,
             S, 4294967295, S^@^@2unknown protein                               
   4294967295, ES^@^@^Kall000, 1.74842164944403e-305, 1.40140740775937e-304,
             S, 4294967295, S^@^@2unknown protein                               
   4294967295, ES^@^@^Kasl000, 1.65915802715193e-306, 3.0355575411186e-304,
             S, 4294967295, S^@^@2unknown protein                               
   4294967295, ES^@^@^Karl550, 5.92969249363755e-307, 6.82232954380406e-307,
             S, 4294967295, S^@^@2ssrA: 10Sa RNA                                
   4294967295, ES^@^@^Kall000, 2.94279990811978e-305, 1.90416083759353e-308,
             S, 4294967295, S^@^@2AtpC: ATP synthase subunit gamma
   ...

We need to back off a little bit. Our format instructions don't seem to be working, so we need something that will give us a more neutral view of the file. Here's another snippet of code to replace # do something...:

      foreach my $i (0 .. length($buffer) - 1) {
         my $c = substr($buffer, $i ,1);
         my $d;
         if ($c =~ /[ a-zA-Z0-9.]/) { $d = $c }
         else { $d = "#" } 
         substr($buffer, $i, 1) = $d;
      }
      print "$buffer\n";

This prints the most common printable characters (blanks, letters, numbers, and periods) as themselves. We print a pound sign (#) in place of the unprintable characters.

Here's the first line of output, broken into three lines at arbitrary places:

#X###S###all0001    S###C   N#sp#####N########S###cS###sp|Q06852|SLP1S### 50  N#
#######S##2unknown protein                                   S###PM# #S###NPun64
7.032N##3#####N##7 ####N1#####BCS###           N########N########N####

Still pretty ugly, eh? But there are some regularities. We definitely see some of the strings -- name and description stand out, among others.

We have a suggestion that numbers are preceeded by a letter, maybe "S". That looks like something we might want to question.

SQ2a: What character signals the start of strings? What character signals the start of numbers?
SQ2b: Make an hypothesis about where each field in the table starts and ends, and write the start and end points in the printout above.

The answers are here.