Write up on Tech Geek History Perl DBI Database (revision 2)

Literature Review

DBI – The Database Interface for Perl 5

DBI, the Database Interface for Perl 5, is an ongoing effort to provide a robust database-independent interface for database connectivity that abstracts the ‘guts’ of database technologies away from the programmer. With the explosion in popularity of Perl as the de rigeur language for CGI programming, a simple, and standard, connection interface to databases is imperative.

The Architecture of DBI

The architecture of DBI is an elegant one. We are channeled b toward the solution by the very concept of the interface that we are trying to define. Furthermore, Perl, as usual, helps us along by providing powerful syntactic constructs and regular expressions that facilitate data processing on the scale required by large database applications.

The DBI Architecture

The DBI interface is the term used to describe both the interface specification – the methods used to build your programs – and the software modules that make this possible. We’ll first take a look at why you should be using DBI, and then describe the modular structure of DBI and its interaction with Perl. Finally, we’ll show you some sample DBI code.

Why DBI?

Database Independence

Database programming, as you will learn after you’ve programmed or administered a few different ones, is pretty much-of-a-muchness. Hence DBI. The fundamental processes involved in doing anything with a database are pretty similar right across the board: connecting to and disconnecting from the database, opening and closing cursors,( A cursor is a construct identifying a particular set of rows in a database.) and storing and retrieving data. The typical order in which this happens is:

Load DBI
Connect to database, loading the required driver (DBD)
Open a cursor containing a SQL statement.( Pronounced “sequel.” SQL is an ANSI-compliant Standard Query Language for querying and manipulating data within databases.)
Fetch the results
Close the cursor
Disconnect from the database
Exit

Now, there are differences between each database engine, like datatypes of fields retrieved, precision, extra non-ANSI-compliant features and so on, but their basic operations are the same from system to system. DBI provides a unified layer so that people can write portable code while still allowing access to the non-standard features as well.

Once we’ve fetched some data from the database, what next? In general, DBI simply provides your data as scalar variables, which you can then manipulate as you would any other scalar.

Another feature is the ability to connect to more than one database simultaneously from within the same Perl program – even (dare we say it?) to databases from different vendors. You can connect to an Oracle and an mSQL and an Informix database at the same time. “What’s the point in that?” I hear you cry. Well, say you have a corporate Oracle database, and you want to display some of its data on the Web. Oracle via CGI can be slow due to Oracle’s internal login procedures, so you’ve decided to use mSQL as well.

Option number 1 is to write a program that runs SQL scripts and dumps the data to a flat file, perhaps comma-separated, which another program then reads and loads into the mSQL database. This is far too much like hard work.

Option number 2 is to write a single Perl script that connects simultaneously to both the Oracle and the mSQL databases, reads the data from Oracle, and puts it into mSQL transparently. Change Oracle to Informix and you won’t need to alter a thing.

Think about that the next time you’re writing two separate programs in C (a well-known excellent string and regular-expression-handling language) with different vendor precompilers on separate platforms. Then think on the paradise of cross-platform portability and cross-database connectivity. You know it makes sense.

DBI is a living organism and, coupled with Perl’s current popularity for CGI scripting language and rapid development, will become a more important factor in decisions to use Perl as a “serious” programming language. Similarities have been drawn between DBI and ODBC( The Open Database Connectivity API. This defines an API for accessing databases based on X/Open and ISO standards. ODBC has taken several years to spread from being Microsoft specific to being shipped with Unix operating systems, such as Solaris 2.5. ODBC is more complex than DBI! ) that have led to questions of “Why do we need DBI?” DBI is simpler than ODBC. DBI will run immediately on more platforms than ODBC, except, rather perversely, the platform ODBC originally sprang from: Microsoft Windows. DBI is free. If this isn’t enough to convince you, DBI will be extended in the future to allow you to use off-the-shelf ODBC drivers anyway.

Databases vs. dbm

Databases, be they Object Database Management Systems (ODBMS) or the more common Relational Database Management Systems (RDBMS) are engineered to store and retrieve data. That’s their prime purpose. Unix was originally blessed with simple file-based “databases”, namely the dbm system. dbm lets you store data in files, and retrieve that data quickly. However, it also has two notable drawbacks:

File Locking. The dbm systems did not allow particularly robust file locking capabilities, nor any capability for correcting problems arising from simultaneous writes.

Arbitrary Data Structures. The dbm systems only allows a single fixed data structure: key-value pairs. That value could be a complex object, such as a struct, but the key had to be unique. This was a large limitation on the usefulness of dbm systems.

However, the dbm systems still provide a useful function for users with simple datasets and limited resources, since they’re fast, robust, and extremely well-tested. Perl modules to access dbm systems have now been integrated into the core Perl distribution via the AnyDBM_File module.

The Modules

Perl 5 has a powerful mechanism to ‘plug’ external modules into the Perl interpreter. This mechanism is realized by actually compiling and linking the module into the Perl interpreter, or by dynaloading (dynamically loading) the module into a running interpreter only when needed. This notion of separable modules is central to the philosophy of both Perl and DBI.

DBI essentially acts as a conduit for the DBD (Database Driver) modules. The DBDs all implement the methods defined in DBI, (e.g. connect()), but in a database-specific way. To clarify this somewhat: Since you, the DBI user, wish to use a completely database-independent programming layer, some part of the system must know how to execute the database-dependent code. That’s what the DBD does. The application programmer need never know the DBD is there! All they will be aware of is the database-independent methods defined by DBI. The DBD code is written and maintained by many volunteers and now cover a fairly broad base of database vendors, including Oracle, Informix, mSQL, Ingres, and Sybase. The Web pages listed at the end of this article contain more detailed and up-to-date information about the state of the different DBDs and the DBI as a whole.

Before we even start looking at how you use DBI, you need to download and install the modules. You will always need the DBI module itself, as well as one of the DBD modules for whichever database you have installed. You can download DBI and the DBD modules from the CPAN, details of which can be found on page 20. Please follow the instructions in the files prefixed with README. They are important. They are called README for a good reason. Your ability to get civil answers to your questions depends on familiarity with them!

Next, you need to inform the Perl interpreter that, hey, now would be a good time to load the DBI module in. This always happens before any other DBI work, such as loading one of the database vendor-specific libraries, or attempting to connect to a database should happen. Now, this seems quite obvious, but if I had a dollar for each time….

#!/usr/bin/perl -w

use DBI;

And that’s it! Honest. If you don’t believe me, try this (assuming you’re running a Bourne-compatible shell):

$ PERL_DL_DEBUG=2 perl -e ‘use DBI;’

DynaLoader.pm loaded (/usr/local/lib/perl5

/i486-linux/5.003/usr/local/lib/perl5/usr/

local/lib/perl5/site_perl/i486-linux/usr/

local/lib/perl5/site_perl . /usr/local/lib

/usr/local/lib /lib /usr/lib)

DynaLoader::bootstrap for DBI (auto/DBI/DBI.so)

See. That simple statement dynamically loads the shared library containing the DBI code into the interpreter, and imports the DBI interface methods, which means we can now start using DBI in anger.

If you refer back to the Architecture diagram on page 15, you can clearly see that all access to the databases are marshalled, or funnelled, through the DBI module. Therefore, we need to make Perl aware of our DBDs. and to do this, we need to use DBI.

To simplify matters, the DBI hides the details of loading the drivers. As you connect to a database with DBI->connect(), the DBI makes sure an appropriate driver is loaded and passes on the request to it. For example:

#!/usr/bin/perl -w

use DBI;

$dbh = DBI->connect( ‘connection_string’, ‘username’,

‘password’, ‘mSQL’ );

die “Can’t connect to database: $DBI::errstr\n” unless $dbh;

loads the DBI driver and then the mSQL driver (since it’s not already loaded) and then attempts a connection to the specified database. This call returns a database handle, which we’ll see more of later. To use this method for other DBDs, simply change the fourth argument from mSQL to your database engine. If not specified, the value of the DBI_DRIVER environment variable will be used instead.

The connect method will croak() if the driver can’t be installed. Otherwise it’ll return UNDEF on any other error and $DBI::errstr will contain an error message. The available_drivers() method returns a list of DBI drivers currently installed. Specific drivers can be loaded using the install_driver() method.

Handles

Handles are Perl objects returned by various DBI methods which the programmer can use to access data at various abstracted layers. The handles that are used by DBI are as follows and can be seen in the figure below.

Driver Handles. A Driver Handle, or drh, encapsulates the database driver itself. The driver handle does not connect you to a database, nor does it let you perform any database operations. It merely acts as a conduit between the DBI and the low-level database API calls. Generally you won’t need to deal with driver handles. The DBI does that.

Database Handles. A Database Handle, or dbh, encapsulates a single connection to a given database via a driver handle. There can be any number of database handles per driver handle. For example, if we have a script that copies data from one database to another where both databases are mSQL, then we will have one driver handle but two database handles. In our earlier scenario with a mSQL database and an Oracle database, we would have two driver handles, each with a single database handle.

Statement Handles. A Statement Handle, or sth, encapsulates a command issued to a database via a database handle. As with database handles, there can be any number of statement handles per database handle – or at least as many as the vendor permits. For example, if we have two tables in our database, one containing data and the other containing a stale copy of the data, and we have a program that refreshes the stale copy from the original, we could use two statement handles, one to SELECT the data from the first table, the second to UPDATE the data in the other table. These statement handles would operate asynchronously, if desired, and with the advent of multi-threading, this asynchronous behavior will become extremely powerful. Cursors are simply statement handles for SELECT statements.

Emulation Layers

Some of the most commonly asked questions on the comp.lang.perl.* newsgroups and DBI mailing lists are “I can’t get Oraperl to compile for Perl 5. It only seems to work with Perl 4. What do I do?” and “I’ve got all this Oraperl/Ingperl/isqlperl stuff and I want to upgrade to DBI, but I don’t want to recode it all. What can I do?”

Well, as per usual, we’re ahead of you. DBD::Oracle was released originally with an Oraperl emulation layer, a layer of software that translates the original Oraperl API calls into DBI method invocations. The upshot: your existing Oraperl code will now work transparently using DBI and DBD::Oracle, which means you can now start writing new code using the DBI interface, whilst either maintaining the Oraperl code, or migrating it to DBI.

The DBD::mSQL developers are about to release an Msqlperl emulation layer, and there is an isqlperl emulation layer being developed for DBD::Informix.

DBI Past and Future

The volunteer work to formulate a database independent interface for Perl started way back in September 1992, when it was known as DBperl. By the time a Perl 4 version of the specification was just about settled in early 1994, Perl 5 was arriving. Rather than rework the draft specification, an object oriented prototype DBI and DBD::Oracle driver were implemented and released in October 1994 by Tim Bunce. As the modules matured, many volunteers developed drivers for other databases.

The DBI is not without problems. Chief among them is a lack of current documentation. Work is underway to revise the DBperl specification (which, even though rather old, is still a detailed and interesting document). Another issue is the ‘narrow’ interface currently offered by the DBI. The current minimal functionality is actually intentional, at least in part. It will help to ensure that existing DBI code continues to work correctly when ODBC support is added. ODBC support is the next big item on the DBI agenda. The plan is to add ODBC support into the DBI so that ODBC drivers can be loaded and used in the same way, and at the same time, as the existing DBD drivers. The DBI will also adopt more ODBC/ISO standard conventions, such as the numeric values used to describe data types.

Everywhere you find data, you find databases. At the simplest level, every file can be considered a database. At the most complex level, expensive and complex relational database systems handle thousands of transactions per second. In between are countless improvised schemes for fast access to loosely structured data. Perl can work with all of them. Early in the history of computers, people noticed that flat file databases don’t scale to large data sets. Flat files were tamed using fixed-length records or auxiliary indices, but updating became expensive, and previously simple applications bogged down with I/O overhead. After some head-scratching, clever programmers devised a better solution. As hashes in memory provide more flexible access to data than do arrays, hashes on disk offer more convenient kinds of access than array-like text files. These benefits in access time cost you space, but disk space is cheap these days (or so the reasoning goes). The DBM library gives Perl programmers a simple, easy-to-use database. You use the same standard operations on hashes bound to DBM files as you do on hashes in memory. In fact, that’s how you use DBM databases from Perl. You call dbmopen with the name of a hash and the filename holding the database. Then whenever you access the hash, Perl consults or changes the DBM database on disk

Perl Cookbook

A flat file is essentially a plain text file that does not contain any structured relationships with other files or databases. This type of file can be as simple as a text document, a CSV (Comma-Separated Values) file, an Excel spreadsheet, or even a binary file. A flat file database consists of just one table, with the size depending on the amount of data it needs to hold. An everyday example could be a list of names, addresses, and phone numbers—this basic data structure can be considered a flat file database.

The concept of flat file databases dates back to the earliest days of computing, arguably beginning with the punch card files created by Herman Hollerith in the 1890s for the US Census Bureau. These early databases did not feature relational aspects, such as indexing or linking of records. Despite the evolution of data storage and management technologies, flat file formats like CSV have remained popular due to their simplicity, efficiency, and ease of use.

Flat files are particularly valued for their ability to facilitate the import and export of data between different software applications or systems. They are stored in a simple text format where each line represents a distinct record, and fields within a record are separated by a specific delimiter, like a comma or tab. This simplicity ensures wide compatibility with various software applications and makes flat files an accessible and flexible option for data transfer tasks without requiring specialized software or database management systems.

Moreover, flat files are easy to create, edit, and customize using basic text editors, allowing users to adjust the structure and content of these files to meet specific data import or export requirements. The straightforward nature of flat files also contributes to their continued use in modern computing environments, from simple lists utilized in web interactions to complex data transfers between disparate systems.

What is a Flat File used for?

Flat files are often used for importing and exporting data because they are a simple and flexible way to transfer data between different software applications or systems. Flat files are typically stored in plain text format, with each line representing a single record, and each record consisting of a set of fields separated by a delimiter such as a comma or a tab.

This format is widely supported by many software applications and can be easily read and processed by a variety of programming languages and tools. Moreover, flat files do not require any special software or database management systems to work with, making them a convenient and lightweight option for data transfer tasks.

Flat files are also easy to create and edit using a simple text editor, which means that users can quickly generate new files or modify existing ones as needed. This makes it possible to customize the structure and content of flat files to meet specific import or export requirements, such as selecting specific fields or records to include or exclude, or formatting the data in a particular way.

What is a Flat File, or Flat File Database?

Flat-File Databases

The simplest type of database that we can create and manipulate is the old standby, the flat-file database. This database is essentially a file, or group of files, that contains data in a known and standard format that a program scans for the requested information. Modifications to the data are usually done by updating an in-memory copy of the data held in the file, or files, then writing the entire set of data back out to disk. Flat-file databases are typically ASCII text files containing one record of information per line. The line termination serves as the record delimiter.

In this section we’ll be examining the two main types of flat-file database: files that separate fields with a delimiter character, and files that allocate a fixed length to each field. We’ll discuss the pros and cons of each type of data file and give you some example code for manipulating them.

The most common format used for flat-file databases is probably the delimited file in which each field is separated by a delimiting character. And possibly the most common of these delimited formats is the comma-separated values (CSV) file, in which fields are separated from one another by commas. This format is understood by many common programs, such as Microsoft Access and spreadsheet programs. As such, it is an excellent base-level and portable format useful for sharing data between applications.[8]

[8]More excitingly, a DBI driver called DBD::CSV exists that allows you to write SQL code to manipulate a flat file containing CSV data.

Other popular delimiting characters are the colon ( : ), the tab, and the pipe symbol ( | ). The Unix /etc/passwd file is a good example of a delimited file with each record being separated by a colon. Figure 2-1 shows a single record from an /etc/passwd file.

2.4.1. Querying Data

Since delimited files are a very low-level form of storage manager, any manipulations that we wish to perform on the data must be done using operating system functions and low-level query logic, such as basic string comparisons. The following program illustrates how we can open a data file containing colon-separated records of megalith data, search for a given site, and return the data if found:

The downside to delimited file formats is that if any piece of data contains the delimiting character, you need to be especially careful not to break up the records in the wrong place. Using the Perl split() function with a simple regular expression, as used above, does not take this into account and could produce wrong results. For example, a record containing the following information would cause the split() to happen in the wrong place:

Stonehenge:Wiltshire:SU 123 400:Stone Circle and Henge:Stonehenge: The most famous stone circle

The easiest quick-fix technique is to translate any delimiter characters in the string into some other character that you’re sure won’t appear in your data. Don’t forget to do the reverse translation when you fetch the records back.

Another common way of storing data within flat files is to use fixed-length records in which to store the data. That is, each piece of data fits into an exactly sized space in the data file. In this form of database, no delimiting character is needed between the fields. There’s also no need to delimit each record, but we’ll continue to use ASCII line termination as a record delimiter in our examples because Perl makes it very easy to work with files line by line.

Using fixed-width fields is similar to the way in which data is organized in more powerful database systems such as an RDBMS. The pre-allocation of space for record data allows the storage manager to make assumptions about the layout of the data on disk and to optimize accordingly. For our megalithic data purposes, we could settle on the data sizes of:[10]

[10]The fact that these data sizes are all powers of two has no significance other than to indicate that the authors are old enough to remember when powers of two were significant and useful sometimes. They generally aren’t anymore.

Field Required Bytes

—– ————–

Name 64

Location 64

Map Reference 16

Type 32

Description 256

Storing the data in this format requires slightly different storage manager logic to be used, although the standard Perl file I/O functions are still applicable. To test this data for the correct record, we need to implement a different way of extracting the fields from within each record. For a fixed-length data file, the Perl function unpack() is perfect. The following code shows how the unpack() function replaces the split() used above:

### Break up the record data into separate fields

### using the data sizes listed above

( $name, $location, $mapref, $type, $description ) =

unpack( “A64 A64 A16 A32 A256”, $_ );

Although fixed-length fields are always the same length, the data that is being put into a particular field may not be as long as the field. In this case, the extra space will be filled with a character not normally encountered in the data or one that can be ignored. Usually, this is a space character (ASCII 32) or a nul (ASCII 0).

In the code above, we know that the data is space-packed, and so we remove any trailing space from the name record so as not to confuse the search. This can be simply done by using the uppercase A format with unpack().

If you need to choose between delimited fields and fixed-length fields, here are a few guidelines:

The main limitations

The main limitation with delimited fields is the need to add special handling to ensure that neither the field delimiter or the record delimiter characters get added into a field value.

The main limitation with fixed-length fields is simply the fixed length. You need to check for field values being too long to fit (or just let them be silently truncated). If you need to increase a field width, then you’ll have to write a special utility to rewrite your file in the new format and remember to track down and update every script that manipulates the file directly.

Space

A delimited-field file often uses less space than a fixed-length record file to store the same data, sometimes very much less space. It depends on the number and size of any empty or partially filled fields. For example, some field values, like web URLs, are potentially very long but typically very short. Storing them in a long fixed-length field would waste a lot of space.

While delimited-field files often use less space, they do “waste” space due to all the field delimiter characters. If you’re storing a large number of very small fields then that might tip the balance in favor of fixed-length records.

Speed

These days, computing power is rising faster than hard disk data transfer rates. In other words, it’s often worth using more space-efficient storage even if that means spending more processor time to use it.

Generally, delimited-field files are better for sequential access than fixed-length record files because the reduced size more than makes up for the increase in processing to extract the fields and handle any escaped or translated delimiter characters.

However, fixed-length record files do have a trick up their sleeve: direct access. If you want to fetch record 42,927 of a delimited-field file, you have to read the whole file and count records until you get to the one you want. With a fixed-length record file, you can just multiply 42,927 by the total record width and jump directly to the record using seek().

Furthermore, once it’s located, the record can be updated in-place by overwriting it with new data. Because the new record is the same length as the old, there’s no danger of corrupting the following record.

2.4.2. Inserting Data

Inserting data into a flat-file database is very straightforward and usually amounts to simply tacking the new data onto the end of the data file. For example, inserting a new megalith record into a colon-delimited file can be expressed as simply as:

#!/usr/bin/perl -w

# ch02/insertmegadata/insertmegadata: Inserts a new record into the

# given megalith data file as

# colon-separated data

### Check the user has supplied an argument to scan for

### 1) The name of the file containing the data

### 2) The name of the site to insert the data for

### 3) The location of the site

### 4) The map reference of the site

### 5) The type of site

### 6) The description of the site

die “Usage: insertmegadata”

.” <data file> <site name> <location> <map reference> <type> <description>\n”

unless @ARGV == 6;

my $megalithFile = $ARGV[0];

my $siteName = $ARGV[1];

my $siteLocation = $ARGV[2];

my $siteMapRef = $ARGV[3];

my $siteType = $ARGV[4];

my $siteDescription = $ARGV[5];

### Open the data file for concatenation, and die upon failure

open MEGADATA, “>>$megalithFile”

or die “Can’t open $megalithFile for appending: $!\n”;

### Create a new record

my $record = join( “:”, $siteName, $siteLocation, $siteMapRef,

$siteType, $siteDescription );

### Insert the new record into the file

print MEGADATA “$record\n”

or die “Error writing to $megalithFile: $!\n”;

### Close the megalith data file

close MEGADATA

or die “Error closing $megalithFile: $!”;

print “Inserted record for $siteName\n”;

exit;

This example simply opens the data file in append mode and writes the new record to the open file. Simple as this process is, there is a potential drawback. This flat-file database does not detect the insertion of multiple items of data with the same search key. That is, if we wanted to insert a new record about Stonehenge into our megalith database, then the software would happily do so, even though a record for Stonehenge already exists.

This may be a problem from a data integrity point of view. A more sophisticated test prior to appending the data might be worth implementing to ensure that duplicate records do not exist. Combining the insert program with the query program above is a straightforward approach.

Another potential (and more important) drawback is that this system will not safely handle occasions in which more than one user attempts to add new data into the database. Since this subject also affects updating and deleting data from the database, we’ll cover it more thoroughly in a later section of this chapter.

Inserting new records into a fixed-length data file is also simple. Instead of printing each field to the Perl file handle separated by the delimiting character, we can use the pack() function to create a fixed-length record out of the data.

2.4.3. Updating Data

Updating data within a flat-file database is where things begin to get a little more tricky. When querying records from the database, we simply scanned sequentially through the database until we found the correct record. Similarly, when inserting data, we simply attached the new data without really knowing what was already stored within the database.

The main problem with updating data is that we need to be able to read in data from the data file, temporarily mess about with it, and write the database back out to the file without losing any records.

One approach is to slurp the entire database into memory, make any updates to the in-memory copy, and dump it all back out again. A second approach is to read the database in record by record, make any alterations to each individual record, and write the record immediately back out to a temporary file. Once all the records have been processed, the temporary file can replace the original data file. Both techniques are viable, but we prefer the latter for performance reasons. Slurping entire large databases into memory can be very resource-hungry.

The following short program implements the latter of these strategies to update the map reference in the database of delimited records:

#!/usr/bin/perl -w

# ch02/updatemegadata/updatemegadata: Updates the given megalith data file

# for a given site. Uses colon-separated

# data and updates the map reference field.

### Check the user has supplied an argument to scan for

### 1) The name of the file containing the data

### 2) The name of the site to search for

### 3) The new map reference

die “Usage: updatemegadata <data file> <site name> <new map reference>\n”

unless @ARGV == 3;

my $megalithFile = $ARGV[0];

my $siteName = $ARGV[1];

my $siteMapRef = $ARGV[2];

my $tempFile = “tmp.$$”;

### Open the data file for reading, and die upon failure

open MEGADATA, “<$megalithFile”

or die “Can’t open $megalithFile: $!\n”;

### Open the temporary megalith data file for writing

open TMPMEGADATA, “>$tempFile”

or die “Can’t open temporary file $tempFile: $!\n”;

### Scan through all the records looking for the desired site

while ( <MEGADATA> ) {

### Quick pre-check for maximum performance:

### Skip the record if the site name doesn’t appear as a field

next unless m/^\Q$siteName:/;

### Break up the record data into separate fields

### (we let $description carry the newline for us)

my ( $name, $location, $mapref, $type, $description ) =

split( /:/, $_ );

### Skip the record if the site name doesn’t match. (Redundant after the

### reliable pre-check above but kept for consistency with other examples.)

next unless $siteName eq $name;

### We’ve found the record to update, so update the map ref value

$mapref = $siteMapRef;

### Construct an updated record

$_ = join( “:”, $name, $location, $mapref, $type, $description );

}

continue {

### Write the record out to the temporary file

print TMPMEGADATA $_

or die “Error writing $tempFile: $!\n”;

}

### Close the megalith input data file

close MEGADATA;

### Close the temporary megalith output data file

close TMPMEGADATA

or die “Error closing $tempFile: $!\n”;

### We now “commit” the changes by deleting the old file…

unlink $megalithFile

or die “Can’t delete old $megalithFile: $!\n”;

### and renaming the new file to replace the old one.

rename $tempFile, $megalithFile

or die “Can’t rename ‘$tempFile’ to ‘$megalithFile’: $!\n”;

exit 0;

You can see we’ve flexed our Perl muscles on this example, using a while … continue loop to simplify the logic and adding a pretest for increased speed.

An equivalent program that can be applied to a fixed-length file is very similar, except that we use a faster in-place update to change the contents of the field. This principle is similar to the in-place query described previously: we don’t need to unpack and repack all the fields stored within each record, but can simply update the appropriate chunk of each record. For example:

### Scan through all the records looking for the desired site

while ( <MEGADATA> ) {

### Quick pre-check for maximum performance:

### Skip the record if the site name doesn’t appear at the start

next unless m/^\Q$siteName/;

### Skip the record if the extracted site name field doesn’t match

next unless unpack( “A64”, $_ ) eq $siteName;

### Perform in-place substitution to upate map reference field

substr( $_, 64+64, 16) = pack( “A16”, $siteMapRef ) );

}

This technique is faster than packing and unpacking each record stored within the file, since it carries out the minimum amount of work needed to change the appropriate field values.

You may notice that the pretest in this example isn’t 100% reliable, but it doesn’t have to be. It just needs to catch most of the cases that won’t match in order to pay its way by reducing the number of times the more expensive unpack and field test gets executed. Okay, this might not be a very convincing application of the idea, but we’ll revisit it more seriously later in this chapter.

2.4.4. Deleting Data

The final form of data manipulation that you can apply to flat-file databases is the removal, or deletion, of records from the database. We shall process the file a record at a time by passing the data through a temporary file, just as we did for updating, rather than slurping all the data into memory and dumping it at the end.

With this technique, the action of removing a record from the database is more an act of omission than any actual deletion. Each record is read in from the file, tested, and written out to the file. When the record to be deleted is encountered, it is simply not written to the temporary file. This effectively removes all trace of it from the database, albeit in a rather unsophisticated way.

The following program can be used to remove the relevant record from the delimited megalithic database when given an argument of the name of the site to delete:

#!/usr/bin/perl -w

# ch02/deletemegadata/deletemegadata: Deletes the record for the given

# megalithic site. Uses

# colon-separated data

### Check the user has supplied an argument to scan for

### 1) The name of the file containing the data

### 2) The name of the site to delete

die “Usage: deletemegadata <data file> <site name>\n”

unless @ARGV == 2;

my $megalithFile = $ARGV[0];

my $siteName = $ARGV[1];

my $tempFile = “tmp.$$”;

### Open the data file for reading, and die upon failure

open MEGADATA, “<$megalithFile”

or die “Can’t open $megalithFile: $!\n”;

### Open the temporary megalith data file for writing

open TMPMEGADATA, “>$tempFile”

or die “Can’t open temporary file $tempFile: $!\n”;

### Scan through all the entries for the desired site

while ( <MEGADATA> ) {

### Extract the site name (the first field) from the record

my ( $name ) = split( /:/, $_ );

### Test the sitename against the record’s name

if ( $siteName eq $name ) {

### We’ve found the record to delete, so skip it and move to next record

next;

}

### Write the original record out to the temporary file

print TMPMEGADATA $_

or die “Error writing $tempFile: $!\n”;

}

### Close the megalith input data file

close MEGADATA;

### Close the temporary megalith output data file

close TMPMEGADATA

or die “Error closing $tempFile: $!\n”;

### We now “commit” the changes by deleting the old file …

unlink $megalithFile

or die “Can’t delete old $megalithFile: $!\n”;

### and renaming the new file to replace the old one.

rename $tempFile, $megalithFile

or die “Can’t rename ‘$tempFile’ to ‘$megalithFile’: $!\n”;

exit 0;

The code to remove records from a fixed-length data file is almost identical. The only change is in the code to extract the field value, as you’d expect:

### Extract the site name (the first field) from the record

my ( $name ) = unpack( “A64”, $_ );

.5.3. Summary of Flat-File Databases

The main benefit of using flat-file databases for data storage is that they can be fast to implement and fast to use on small and straightforward datasets, such as our megalithic database or a Unix password file.

The code to query, insert, delete, and update information in the database is also extremely simple, with the parsing code potentially shared among the operations. You have total control over the data file formats, so that there are no situations outside your control in which the file format or access API changes. The files are also easy to read in standard text editors (although in the case of the Storable example, they won’t make very interesting reading).

The downsides of these databases are quite apparent. As we’ve mentioned already, the lack of concurrent access limits the power of such systems in a multi-user environment. They also suffer from scalability problems due to the sequential nature of the search mechanism. These limitations can be coded around (the concurrent access problem especially so), but there comes a point where you should seriously consider the use of a higher-level storage manager such as DBM files. DBM files also give you access to indexed data and allow nonsequential querying.

Before we discuss DBM files in detail, the following sections give you examples of more sophisticated management tools and techniques, as well as a method of handling concurrent users.

Putting Complex Data into Flat Files (Programming the Perl DBI)

3.1 Record Types in Perl

Reading simple record-oriented data Perl makes it very easy to deal with record-oriented data, particularly simple records of the type we are discussing here. We have seen before the idiom where you can read a file a line at a time using a construct like

while ()

{ chomp; # remove newline # each line in turn is assigned to $_ }

The most important part of the construct is the use of to read data from the file handle FILE which has presumably been assigned to a file earlier in the program by a call to the open function. This file input operator can return two different results, depending on whether it is used in scalar context or array context. When called in a scalar context, the file input operator returns the next record from the file handle. This begs the obvious question of what constitutes a record. The answer is that input records are separated by a sequence of characters called (logically enough) the input record separator. This value is stored in the variable $/.

The default value is a newline \n (which is translated to the appropriate actual characters for the specific operating system), but this can be altered to any other string of characters. We will look at this usage in more detail later, but for now the default value will suffice. When called in an array context, the file input operator returns a list in which each element is a record from the input file. You can, therefore, call the file input operator in one of these two ways:

my $next_line = ;

my @whole_file = ;

In both of these examples it is important to realize that each record—whether it is the record stored in $next_line

or one of the records in

@whole_file—will still contain the value of $/ at the end.1

Often you will want to get rid of this and the easiest way to do it is by using the chomp function. chomp is passed either a scalar or an array and removes the

value of $/

from the end of the scalar or each element of the array.

If no argument is passed to chomp then it works on $_. 2 Reading data a record at a time (from first principles) Now that we understand a little more about the file input operator and chomp, let’s see if we can build our standard data munging input construct from first principles. A first attempt at processing each line in a file might look something like this

my $line;

while ($line = )

{ chomp $line; … }

Fixed-width data is becoming less common, but it is still possible that you will come across it, particularly if you are exchanging data with an older computer system that runs on a mainframe or is written in

3.2 COBOL

Reading fixed-width data In a fixed-width data record, there is nothing to distinguish one data item from the next one. Each data item is simply written immediately after the preceding one, after padding it to a defined width. This padding can either be with zeroes or with spaces and can be before or after the data.1 In order to interpret the data, we need more information about the way it has been written. This is normally sent separately from the data itself but, as we shall see later, it is also possible to encode this data within the files. Here is an example of two fixed-width data records: 00374

Bloggs & Co 19991105100103+00015000 00375

Smith Brothers 19991106001234-00004999

Multiple record types One slight variation of the fixed-width data record has different sets of data fields for different types of data within the same file. Consider a system that maintains a product list and, at the end of each day, produces a file that lists all new products added and old products deleted during the day. For a new product, you will need a lot of data (perhaps product code, description, price, stock count and supplier identifier). For the deleted product you only need the product code (but you might also list the product description to make the report easier to follow).

Each record will have some kind of identifier and the start of the line denoting which kind of record it is. In our example they will be the strings ADD and DEL. Here are some sample data:

ADD0101Super Widget 00999901000SUPP01

ADD0102Ultra Widget 01499900500SUPP01

DEL0050Basic Widget DEL0051Cheap Widget

On the day covered by this data, we have added two new widgets to our product catalogue. The Super Widget (product code 0101) costs $99.99 and we have 1000 in stock. The Ultra Widget (product code 0102) costs $149.99 and we have 500 in stock. We purchase both new widgets from the same supplier. At the same time we have discontinued two older products, the Basic Widget (Product Code 0050) and the Cheap Widget (Product Code 0051).

Data with no end-of-record marker Another difference that you may come across with fixed-width data is that sometimes it comes without a defined end-of-record marker. As both the size of each field in a record and the number of fields in a record are well defined, we know how long each record will be. It is, therefore, possible to send the data as a stream of bytes and leave it up to the receiving program to split the data into individual records. Perl, of course, has a number of ways to do this. You could read the whole file into memory and split the data using

substr or unpack,

but for many tasks the amount of data to process makes this unfeasible. The most efficient way is to use a completely different method of reading your data. In addition to the syntax that reads data from file handles one record at a time, Perl supports a more traditional syntax using the read and seek functions. The read function takes three or four arguments. The first three are: a file handle to read data from, a scalar variable to read the data into, and the maximum number of bytes to read. The fourth, optional, argument is an offset into the variable where you want to start writing the data (this is rarely used). read returns the number of bytes read (which can be less than the requested number if you are near the end of a file) and zero when there is no more data to read. Each open file handle has a current position associated with it called a file pointer and read takes its data from the file pointer and moves the pointer to the end of the data it has read. You can also reposition the file pointer explicitly using the seek function. seek takes three arguments: the file handle, the offset you wish to move to, and a value that indicates how the offset should be interpreted.

If this value is 0 then the offset is calculated from the start of the file, if it is 1 the offset is calculated from the current position, and if it is 2 the offset is calculated from the end of the file. You can always find out the current position of the file pointer by using tell, which returns the offset from the start of the file in bytes. seek and tell are often unnecessary when handling ASCII fixed-width data files, as you usually just read the file in sequential order. Example: reading data with no end-of-record markers using read As an example, if our previous data file were written without newlines, we could use code like this to read it (obviously we could use any of the previously discussed techniques to split the record up once we have read it):

my $template = ‘A5A20A8A6AA8’;

my $data;

while (read STDIN, $data, 48) {

my @rec = unpack($template, $data);

https://perlhacks.com/dmp.pdf

Reference

Programming the Perl DBI Alligator Descartes and Tim Bunce

O’Reilly & Associates, Inc., United States

01 January 2000

Putting Complex Data into Flat Files (Programming the Perl DBI)

What is a Flat File, or Flat File Database?

DBI – The Database Interface for Perl 5 – The Perl Journal, Spring 1997