Write up on Tech Geek History: CODASYL (Revision 2)

Literature Review

Definition

CODASYL, which stands for Conference on Data Systems Languages, was a standards organization that played a significant role in the development of early computer languages and database management systems. It is best known for its work on COBOL and the CODASYL network data model, which was a precursor to modern relational databases.

A data base may be defined as a collection of interrelated data stored together with as little redundancy as possible to serve one or more applications in an optimal fashion; the data are stored_ so that they are independent of programs which use the data? a common and controlled approach is used in adding new data and in modifying and retrieving existing data within the data base.

Two t v p e s of languages are mentioned in connection with DBMS. The first is the Data Description Language (DDL) which describes the types of data entities which may exist along with the allowable attributes.

There may be two DDL * or two levels of DDL for describing a data base. The first level description is the system’s view of the data base as it is actually organized and the second* a user’s view of the data base. These levels are called the schema and subschema respectively. In the relational model terminology* the DDL may be called the relational algebra.

The second 1 language is the Data Manipulation Language (DM|_) which is concerned with the storage* retrieval and modification of specific occurrences of the entity types described by DDL statements. In relational model terminology* this 1 language corresponds to the relational calculus. The entities handled by DDL and DML may be records* sets or anything that may need manipulation. The attributes may be such thin as data items* set membership* set ownership or location within the data base.

The data base model is the et a-st rue t ure which is imposed on the organization of the data base. The model prescribes the types of entities which are allowed. It defines the data attributes and structural attributes that an entity may have.

The definition of a DDL and DML is the 15 implementation of the met a-st rue ures of a data base model. Currently the two most widely discussed models are the network model and the relational model,

COBOL data manipulation language (DML) is a programming language extension that provides a way for a COBOL application program to access a database. A COBOL database application program contains DML statements that tell the Database Control System (DBCS) what to do with specified data; the DBCS provides all database processing control at run time. The four classes of DML statements are data definition, control, retrieval, and update. An explanation of each class follows, together with important definitions of members of that class:

Data definition—These entries define the specific part of the database to be accessed by the application program and any keep lists needed to navigate through it. The entries also result in the creation of a database user work area (UWA). Transfer of data between your program and the database takes place in the UWA. Your program delivers data for the DBCS to this area; it is here that the DBCS places data requested from the database for retrieval to your program.

Terminology and Concepts.

For a complete description of the CODASYL schema DDL statements and DBMS design see Ref. 2. The schema DDL is used to describe a data base and has the following entity types: Data items, data aggregates, records, areas and sets .

A data item is an occurrence of a named atomic data attribute. It is the smallest unit of named data.

The set of values that a data item can assume is called its range.

The range of an item is always restricted to values of a particular type. The possible types are arithmetic data, string data, data base keys and implementor defined types.

A data aggregate is an occurrence of a named collection of data items. There are two kinds: vectors and repeating groups. A vector is a one dimensional sequence of data items, all with identical characteristics. A repeating group is a collection of data attributes that occurs multiple times within a record occurrence. The collection of attributes may include data items and data aggregates.

A record is an occurrence of a named collection of zero or more data items or data aggregates. Each record entry defines a record type of which there may be zero or more occurrences within the data base.

The record is the smallest addressable entity within the data base. A set is a named collection of records. Each set entry in the schema defines a set type for which zero or more occurrences (sets) may exist in the data base.

Each set type declared in the schema must have one record type declared as its owner and may have one or more record types declared as its members. Each set occurrence which exists in the data base must contain exactly one record of its owner type and zero or more o * its member record types.

An area is a named collection of records which need not preserve owner/member relations. An area may contain occurrences of multiple record types and a record tyoe may occur in multiple areas. A particular record occurrence of a 22 record is assigned to an area when it when it is created and it may not migrate out of that area. An area may be declared to be temporary. Temporary areas are created especially for a run-unit/ exist for the life of the run-unit and are destroyed when the process terminates.

Data Base Keys . The DDL assumes that every record occurence in the data base has a unique identifier which enables the DBMS to distinguish it from every other record in the data base. This key must be assigned when the record is created and remains with it for the life of the record. This key may be supolied to the DBMS by a run-unit or data base Procedure* generated from the record’s contents or assigned by the DBMS.

Network Model

The popularity of the network data model coincided with the popularity of the hierarchical data model. Some data were more naturally modeled with more than one parent per child. So, the network model permitted the modeling of many-to-many relationships in data. In 1971, the Conference on Data Systems Languages (CODASYL) formally defined the network model. The basic data modeling construct in the network model is the set construct. A set consists of an owner record type, a set name, and a member record type. A member record type can have that role in more than one set, hence the multiparent concept is supported. An owner record type can also be a member or owner in another set. The data model is a simple network, and link and intersection record types (called junction records by IDMS) may exist, as well as sets between them . Thus, the complete network of relationships is represented by several pairwise sets; in each set some (one) record type is owner (at the tail of the network arrow) and one or more record types are members (at the head of the relationship arrow). Usually, a set defines a 1:M relationship, although 1:1 is permitted. The CODASYL network model is based on mathematical set theory.

Hierarchical Model

The hierarchical data model organizes data in a tree structure. There is a hierarchy of parent and child data segments. This structure implies that a record can have repeating information, generally in the child data segments. Data in a series of records, which have a set of field values attached to it. It collects all the instances of a specific record together as a record type. These record types are the equivalent of tables in the relational model, and with the individual records being the equivalent of rows. To create links between these record types, the hierarchical model uses Parent Child Relationships. These are a 1:N mapping between record types. This is done by using trees, like set theory used in the relational model, “borrowed” from maths. For example, an organization might store information about an employee, such as name, employee number, department, salary. The organization might also store information about an employee’s children, such as name and date of birth. The employee and children data forms a hierarchy, where the employee data represents the parent segment and the children data represents the child segment. If an employee has three children, then there would be three child segments associated with one employee segment. In a hierarchical database the parent-child relationship is one to many. This restricts a child segment to having only one parent segment. Hierarchical DBMSs were popular from the late 1960s, with the introduction of IBM’s Information Management System (IMS) DBMS, through the 1970s.

NETWORK DATA MODEL The Conference on Data Systems Languages (CODASYL), the organization comprising of vendor representatives and user groups, developed the language COBOL. In the late 1960s, CODASYL appointed a subgroup known as the Database Task Group (DBTG) to develop standards for database systems. DBTG published a preliminary report in 1969. Based on revisions and suggestions made for improvement, DBTG published a revised version of the report in 1971.

Essentially, the network data model is based on the 1971 DBTG report. This data model conforms to a three-level database architecture: conceptual, external, and internal levels. A number of commercial database systems were developed to implement the network data model. Summary of Basic Concepts

• Data is organized in the form of records being arranged as a network of nodes.

• Two fundamental modeling concepts make up the network data model: record types and set.

• Two record types are linked as a set. The set expresses the one-to-one or one to-many relationship between two record types.

• A set expressing the relationship between two record types consists of a member record type and an owner record type.

• One owner record type may be part of different sets with different member record types.

• Similarly, one member record type may have multiple owner record types.

• A network consisting of one-to-one or one-to-many relationships is known as a simple network. A complex network, on the other hand, contains many-to many relationships also.

• Each record type generally represents an entity type of the organization. Data fields in the segment types denote the attributes of the entity type.

• An instance of a set type represents one occurrence of the entity represented by the record type.

• Logical links between related records are implemented through physical addresses (pointers) in the record itself.

In order to specify the relationship between DDL declarations and DML functions a set of basic data manipulation functions must be defined which is DML and host language independent. Specific commands provided by a particular DML must be resolved into those basic functions. The resolution is defined by the implementor of the DML.

A description of the Schema DDL consists of four major sections:

an introductory clause

• one or more AREA clauses

• one or more RECORD clauses

• one or more SET clauses.

Bottom of Form

________________________________________________________

Database processing in the COBOL programming language. Data-base applications written In a host programming language are often associated with both large data bases which contain 108 characters or more and a well-known set of applications or transactions, perhaps run hundreds of times a day, triggered from individual terminals. Such extensive processing must be efficient, and the designers of the DBTG system took care that its applications could be tuned to ensure efficiency. Although the DBTG recognized the importance of supporting other language interfaces for a data base, especially “self-contained” languages for unanticipated queries, it did not directly address the problem of other interfaces

Automated help is available in the recordkeeping phase through the use of data dictionary software to catalog various characteristics of the attributes–name, length, type, who generates it, who uses it, etc. Once the relevant attributes are identified, the data-base administrator has the problem of grouping attributes together into proper entities. Some possible guidelines for doing this are:

Determine those attributes (or concatenations of attributes), occurrences of which identify the entities being modeled. Call these attributes identifiers or candidate keys. For example, if students are the entity being modeled and each student has a unique student number as well as a social security number, then both student number and social security number are identifiers. Group together all identifiers for a particular entity.
Determine those other attributes of an entity that describe it, and there will be only one value of this attribute for a given entity, but the attribute is not part of an identifier. Consider grouping these non-identifiers attributes with the identifier or identifiers of the entity. For example, if students have a name and are admitted from a given high school, the items name and high school would be grouped with the student number and social security number.
If, for an identifier, there may be several values of an attribute, consider whether this “repeating item” may be better modeled as part of a separate entity. For example, if a student is enrolled in several courses, consider whether courses are not themselves separate entities worthy of being modeled. If they are, then see guideline 4 below. If not (for example, if educational degrees are considered to describe a student but are not entities in themselves), then either allocate a separate attribute for each of the finite number of repetitions (degree 1, degree 2, etc.), or associate a dependent, repeating structure with the entity.

In the data description language, is normally compiled into internal tables of the DBMS. Before this is possible, however, other design decisions must be made. Information structure design deals with entities and attributes. In contrast, data-base management systems manage records which are organized as indexed sequential files, hashed or direct access files, inversions, ring structures, or other structures [G1]. Thus it is necessary to reduce the entity and attribute level of data-base design to the world of computers, e.g., to choose among storage allocation strategies; to equate entities with records, perhaps

In the data description language, is normally compiled into internal tables of the DBMS.

Before this is possible, however, other design decisions must be made. Information structure design deals with entities and attributes. In contrast, data-base management systems manage records which are organized as indexed sequential files, hashed or direct access files, inversions, ring structures, or other structures [G1]. Thus it is necessary to reduce the entity and attribute level of data-base design to the world of computers, e.g., to choose among storage allocation strategies; to equate entities with records, perhaps representing associations with pointer structures; to decide whether some attributes should be indexed; and to choose between one-way and two-way lists. A DBMS offers a variety of options during the data definition stage, and the data-base administrator must choose a reasonable (if not optimal) alternative. Further, if any validity, integrity, or privacy constraints are to be enforced, these must also be stated.

A description of the Schema DDL consists of four major sections:

• an introductory clause

• one or more AREA clauses

• one or more RECORD clauses

• one or more SET clauses.

The introductory clause is used to name the data base and to state certain global security and integrity constraints. An area is a logical subdivision of the data base, which in many implementations corresponds to a file or data set in an operating system. While we usually think of a data base as being a single integrated collection of data, it is often desirable to subdivide such a data base into multiple logical subunits, in order to implement special security and integrity!

Constraints and to provide a mechanism to control the performance and cost of implementation. Data-base security can be increased by placing highly sensitive data in logically separate areas and by placing special controls over those areas.: Of course, physical separation of the areas may be used to increase security.

Data-base integrity can be improved by placing critical data in areas that are safe from harm or are often duplicated, while high performance areas may need to reside on high Ispesd devices. Similarly, costs can be minimized by placing infrequently used data in areas which reside on less costly devices.

Any logical or physical reason for splitting the data can utilize the area concept. The area description in the Schema DDL allows the data-base administration to name these subdivisions of the data base and to specify which of the areas contain which record types. The actual mapping of areas to one or more physical storage volumes is under the control of a separate device media control language (DMCL),

Every record type in a data base there exists a description in the Schema DDL. A schema record description consists of information about the record type, such as its storage and location mechanism, and information about the area or areas in which occurrences of the record type may be placed. The record description contains a description of all data items that constitute the record type. A record occurrence in the stored data base consists of occurrences of each data-item type that constitute the record type. These record occurrences are the units of data transfer between the stored data base and an application “program.

Thus, the application programmer interface uses a “record at a time” logic (one record occurrence is delivered or stored for each command) in accessing the data base. For each set type in a data base, a separate set description is written in the Schema DDL. Each set description names the set type, specifies the owner-record type and member-record type or types, and states detailed information on how occurrences of the set are to be ordered and selected. The introductory section of a schema description consists of a statement naming the schema, and certain security and integrity constraints. For our sample data base, this introductory section is:

specification of a language to describe the structure and contents of a data base. This description is called a schema. The schema language represents one of several languages which data base designers, implementors and users will employ. Other languages include current procedural programming languages, for example, COBOL and FORTRAN, data manipulation languages, device media control languages, and languages to control the execution of work (data processing) on a computer system. The current procedural programming languages must contain the following elements to be used with a schema language controlled data base:

• A subschema language to describe a subset of the schema which is of interest to a particular application program. A subschema enables an application program to deal with a subset of the data in the data base. The subschema may also vary in certain respects from the schema with respect to particular elements in the data base.

• A data manipulation language (DML) used at execution time to handle all program interfaces to the data base.

The basic data manipulation functions assumed in these specifications include the functions required to:

• Select records.

• Present records to the run unit.

• Add new records and relationships.

• Change existing records and relationships.

• Remove existing records and relationships.

DATA BASE PROCEDURES At various points in the accessing of a data base, computations are required which are specific to that particular data base. Some examples of these computations are: 2. 14 June 1, 197^

• Checking of privacy keys for validity.

• Computation of data item values as functions of other data item values.

• Searching algorithms.

• Compression and expansion of values of data items.

• Validation of values of data items.

• Systems instrumentation. The routines which perform these computations are called data base procedures. They are stored in the system where they can be invoked by the DBMS when they are needed. The rules for writing data base procedures (that is, linkage conventions, allowable side effects, programming languages in which they are written, etc.) are implementor defined.

3.0.7 DATABASE-DATA-NAMES

A data-base-data-name is a user defined word that names a data item or data aggregate. When used in a general format ‘data-base-iata-name* may not be subscripted or qualified unless specifically permitted by the rules for that format. The named data item or data aggregate need not be the subject of a Data Subentry.

3.0.8 DATABASE-IDENTIFIERS

A data-base-identifier is a reference to a data item or data aggregate declared in the schema. It consists of a data-base-data-name followed, as required, by the syntactically correct combination of subscripts and qualifiers necessary to achieve uniqueness of reference.

Data Manipulation Language

MAPPING CODASYL-DML FIND STATEMENTS

The FIND statement is logically required before each of the major CODASYL DML statements, except for the STORE statement.

When a user issues a FIND command, a record is found, and it is placed in the currency indicator table (CIT).

The format of the FIND statement is:

FIND record_selection_expression [ ], while the general format of the ABDL RETRIEVE statement is: RETRIEVE Query Target-list [by attributes]

Each of the preceding formats is presented using the following conventions: upper-case notation represents literals, lower-case represents user-supplied variable names, and square brackets contain optional clauses.

The FIND statement has several variants, and we will, in turn, present each of these.

The FIND ANY Statement The FIND ANY statement locates a specified record of type whose value for the specified data items are equal to those in that record’s template in the user work area (UWA).

The syntax of the statement is: FIND ANY record_type_x USING

item_l, …, item_n IN record_type_x KMS. in mapping the

FIND ANY statement, must use the ABDL RETRIEVE statement and form a query whose first predicate is (FILE = record_type_x).

KMS then forms the additional predicates by locating the values of the relevant data items in the record-template. The request is then executed with the results being placed in the result buffer (RB). Following the request execution, KMS creates the target list consisting of the requested records attributes. Thus, the ABDL translation of the the CODASYL-DML statement is:

RETRIEVE ((FILE: record_type_x) AND (item_l = value_l) AND (item_n = value_n)) (all attributes) [by record_type_x]

The requirement is to find any course record whose title is Advanced Database’.

The CODASYL-D ML procedure is:

MOVE ‘Advanced Database’ TO title IN course FIND ANY course USING title IN course

It should be noted that the MOVE statement is an assignment statement found in the host COBOL language and in the above transaction it serves to initialize the UWA field title in course.

KMS would make the following translation and actions:

(1) ‘Advanced Database’ is placed in the course template of the UWA for the attribute title.

(2) A RETRIEVE request is formed: RETRIEVE ((FILE = course) AND (title = ‘Advanced Database)) title, dept, semester, credits) BY course

(3) Pass the request to KC for execution. The result is that the course record satisfying the search criteria are placed in RB.

The FIND CURRENT Statement The FIND CURRENT statement causes an update of CIT by changing the current of the run-unit from its present value to the value of the database key of the current record of a specified set type. The statement is of use when we want to begin a search at the current of a particular set, which requires that the current of the run-unit be updated to agree with it.
The syntax of the FIND CURRENT statement is : FIND CURRENT record_type_x WITHIN set_type_y

The only function of this statement is to update CIT, and therefore it is a relatively simple task for KMS to handle as there is no direct mapping to an ABDL statement. An example taken from the University database illustrates the use of the FIND CURRENT statement: FIND CURRENT student WITHIN person_student KMS would pass the CIT update information to KC for execution, and where CIT is actually updated. The current of run-unit becomes the current student record occurrence of the current person_student set occurrence.

The FIND DUPLICATE WITHIN Statement The FIND DUPLICATE WITHIN statement is used to sequentially access records within a particular set occurrence. A basic assumption is that the requested 56 records have previously been located by another FIND and are therefore already resident in RB. The statement then locates the first record with the current set occurrence whose values for the listed items match those of the current record of the set.
The syntax of the FIND DUPLICATE WITHIN is:

FIND DUPLICATE WITHIN set_type_x USING item_l, …. item_n IN record_type_y

The translation actions are as listed below:

(1) KMS forwards set_type_x, record_type_y, and item_I,…, item_n to KC.

(2) KC locates the relevant RB using the information from (1) above.

(3) Each record with RB is searched until the first duplicate record with the set is found.

(4) The record is made available to the user. Additionally, KC will update CIT following the accessing of each record presented to the user.

4. The Find FIRST/LAST/NEXT/PRIOR Statements This subsection presents several related variants of the FIND statement; they identify a record by its position in a set. For instance, the FIND FIRST statement locates the first record of a set occurrence, the FIND LAST statement locates the last record of a set occurrence, and so on. Each of these statements is mapped in the same manner, and therefore we will focus the translation explanation on the FIND FIRST statement. The syntax for the FIND FIRST statement is:

FIND FIRST record_type_x WITHIN set_type_y

First of all, KMS ensures that the specified record type is a member of the specified set occurrence. This is accomplished by checking the nsn_setjnemberjiame field of the nsetjiode data structure of

Once the set membership is verified, KMS forms a RETRIEVE request that places every member record of the set occurrence into its RB. The request is satisfied by returning the first record. In the case of FIND NEXT and FIND PRIOR, the set occurrence must have previously been retrieved and placed into RB. KMS must simply check CIT and determine the current of the set and return either the next or the prior record. Recalling the two types of sets in the functional data model, ISA relationships and Daplex functions, we have devised two methods for accessing all members of a particular set occurrence.

The first method is for retrieving members of a set type reflecting an ISA relationship where the set name consists of the owner name, followed by “_”, followed by the member record name.

KMS generates the following ABDL request: RETRIEVE ((FILE = record_type_x) AND (MEMBER, set_type_y = set_type_x.owner.dbkey)) (all attributes)

Write up on Tech Geek History: CODASYL (Revision 2)

Leave a Comment Cancel Reply