Chapter 3 Section 4Relational Databases and Data Normalization |
Databases can be either single-user or multi-user applications. A collection of recipes on a personal computer is probably a single-user application, but a student record system on a main frame is a multi-user application. Each database may be either single or multi-user. We introduce them in this chapter.
In this section we will begin with an overview of database terminology and concepts, which will be followed by an examination of how data is organized and managed in relational databases using normalized tables.
In a simple index card database, such as a Rolodex file, each card (or record) contains the same spaces (called fields) for items of data, such as "Last Name" "First Name" "Phone #" "Company" etc. This type of data organization works well for simple tasks, such as organizing a list of phone numbers, but does not allow for the inter-relationships between different groups of data items that more complex data management applications require.
If we examine three records taken from a simple Rolodex file "card by card," the data might look like this:
| _______________________________________ | ||
|
Last Name: Smith Phone #: 843-5467 | First Name: Ralph Company: Roico |
|
| _______________________________________ | ||
| Last Name: Brown Phone #: 843-7890 | First Name: Robert Company: Bainbridge |
|
| _______________________________________ | ||
|
Last Name: Binder Phone #: 876-5252 | First Name: Mike Company: Albright |
|
| _______________________________________ | ||
Each "card" above represents one record, and each record consists of four data fields. By contrast, relational databases are based upon data relationships defined through the use of tables. Each table consists of rows and columns, with key relationships between certain data items in each table establishing links. In tabular form, where each row is a record, and each column a field, the same data would look like this:
| Last Name | First Name | Phone # | Company |
| Smith | Ralph | 843-5467 | Roico |
| Brown | Robert | 843-7890 | Bainbridge |
| Binder | Mike | 876-5252 | Albright |
When more complex databases are created using several tables such as these, the various tables represent different data entities, which exist within the overall view of the data. Entities within databases are simply combinations of certain elements of the data that we wish to examine in terms of both their contents and their relationship to each other. To examine the implications of complex databases, we will use the example of a college's student information system, designed to record and report upon students, courses, and the relationships between the two.
In the following sample database application, there is a table containing information about students. In that table, there is a record for a student (an entity) called John Murdock. Each column (field) in the table represents an attribute of the entity, that is, some data item that describes the entity. For example, "student name" is an attribute of a student.
One or more attributes make up a unique 'key' for the record (also known as an instance or, in terms of the table, a row). If the key is a combination of two or more attributes, it is called a combined or concatenated key. The components of a concatenated key are generally keys from other tables, and each part of the concatenated key also serves as a link to the original table. The components of a concatenated key from other tables are referred to as foreign keys. The use of concatenated keys in normalized data tables is explored in the example at the end of this section.
In the example of the college information system, John Murdock's student identity (ID) number is a unique value that is used on the computer system to identify him as a student. Later we will see how the student ID number key is also used as part of a concatenated key.
Non-key data items only have meaning when associated with their key values. For example, John's record on the student table might contain a phone number. Taken by itself, without the ID number that connects it to a name, the phone number is a meaningless piece of data.
In the student/course database, multiple tables are used to manage the information pertaining to student enrollment in courses. For example, one table contains data about students. A second table will contain data about courses available, while a third table represents the relationship between the other two entities, resulting in a third entity.
The diagram below illustrates the nature of the relationships among these tables. A single point on the connecting arrow denotes a single instance of the entity and a double point represents a multiple instance. What this means in this example is that a single student could be enrolled in many courses, but that each enrollment instance is associated with a single student. Likewise, on the other side of the diagram, a single course can have many enrollments, but each enrollment is only associated with a single course.
TABLE 1 (STUDENTS) <---->>TABLE 2 (ENROLLMENTS) <<----> TABLE 2 (COURSES)
To describe this another way:
A student instance can be related to (or linked with) many enrollment instances.
An enrollment instance must relate to one and only one student instance.
A course instance can be related to (or linked with) many enrollment instances.
An enrollment instance must relate to one and only one course instance.
We would describe the relationship of student instance to enrollment instance in the example above as a one-to-many relationship. In this definition, we have also assumed that a student instance can exist without an enrollment instance. That is, there can be a student who has not yet enrolled in any courses. We have also assumed a course can exist without anyone enrolled in it.
The essential function of such a system is called a query, which is basically a question the user asks the database.
With a system such as this you can obtain answers to many different queries. Some may involve only one table, e.g., "Who is registered in the course COMP?" This would provide student names only. More complex queries involving extracting data from more than one table are also possible, as shown in the following examples.
Who is enrolled in course # 6543provide student names and numbers
These types of queries are generally written in a fourth-generation language such as SQL (Structured Query Language). This language allows you to use an English-like syntax for expressing the query, rather than having to use the mathematical operators that form the functional basis of relational systems. The following is an example of SQL.
The expression "SELECT STUDENT_NAME, STUDENT_ADDRESS, PROGRAMME_REGISTERED" identifies the fields or data items to include in the response. "FROM STUDENT" tells the system what table to look at, and "WHERE PROGRAMME_REGISTERED= COMP" places a filter on the search, excluding from the report any data from records of students registered in programmes other than "COMP."
Some graphical systems generate SQL for the user by providing a point and click interface to build the query. Graphical languages and SQL are non-procedural programming, which means they involve telling the computer what is wanted, rather than how to do the query, as a procedural programming language like COBOL would require. Fourth-generation languages and graphical interfaces provide tools that can be used by trained users or programmers, depending on the complexity of the problem and skill of the user. Both fourth-generation languages such as SQL and graphical database interfaces make querying the database much easier than using a procedural programming language to perform the same queries.
For relational databases to work properly with multiple tables, the tables must be designed in such a way as to avoid the use of duplicate non-key items, and also to ensure that all non-key items are fully associated with the appropriate key(s). This process is called data normalization and is usually done in stages, usually stopping after three. These stages of normalization are known as 1NF (First Normal Form), 2NF (Second Normal Form) and 3NF (Third Normal Form).
The designer of the database is responsible for ensuring the data is correctly normalized. Only then can the powerful mathematical operators underlying the relational query languages be used with certainty.
We will go through some steps and tests to illustrate the process of normalization to the Third Normal Form (3NF). Students interested in additional tests to normalize relational systems beyond this level may wish to take systems analysis and database courses.
Starting with an un-normalized student record, we would see the following fields: