After completing this section, you should be able to
A Comment on Normalization
Students often find normalization difficult as some of them do not have a sufficient mathematical background to help them to follow discussions in textbooks. The aim of the next few pages is to explain normalization and normal forms in simple words.
One phase in database development is mapping entity-relationship diagrams representing a business organization into relations. If this mapping is poor, then the resulting relational model will suffer from modification anomalies (drawbacks) when inserting or deleting data. Normalization is the process of applying rules (called normal forms) to the relational model to obtain a well-designed relational model. Theoretically, we have six rules called 1NF, 2NF, 3NF, BCNF, 4NF, and 5NF. But for practical purposes, putting relations in BCNF will result in a well-designed relational model, so we will discuss only the first four rules. These normal forms are interrelated in that before placing a relation in 2NF the relation should be in 1NF first.
To place a relation in 3NF, the relation should be in 1NF then 2NF. To place a relation in 4NF, then it should be in 1NF then 2NF, then 3NF, then BCNF, and so on.
Just for your information, there a single rule called DKNF (Domain Key Normal Form), which stands by itself. If we managed to put a relation in DKNF, then we guarantee that it is free from modification anomalies. But it is not known whether it is possible to put any relation in DKNF. Also, this rule is complex and more than we need to obtain a well designed relational model. So, to avoid confusion we will not discuss this rule here.
Now, lets return to normal forms. Since these are concerned with the relational model, we have to make sure that the tables we have really are relations, 1NF takes care of that:
A. 1NF: every relation is in 1NF. If the table we have is a relation then it is in 1NF.
Consider the following example:
A department offers courses for students who pay tuition fees. The following relation represents this situation.
DEPARTMENT (student-no, course-no, fee)
| Student-no | Course-no | Fee |
|---|---|---|
| S101 | C213 | 300 |
| S201 | C213 | 300 |
| S301 | C413 | 350 |
What will happen if S301decided to drop the course and s/he is the only student enrolled in course C413? If we deleted the third row (tuple), then we did not delete only that piece of information saying that S301 is enrolled in C413, but we also deleted that piece of information saying that course C413 is being offered by the department and it costs $350 (we deleted more information than we wanted to). This is an example of deleting anomalies.
Now consider another situation. If the department offers course C513 and no student is registered in the course, then we cannot enter that piece of information until a student enrolls in that course, because according to the definition of a relation we cannot insert a null value in an attribute. This is an example of inserting anomalies.
These anomalies occurred because as you see above the key is compound and the non-key attribute (i.e. fee) depends on only part of the key. The 2NF resolves this problem.
B. 2NF: a relation is in 2NF if
Obviously, in our example above, fee depends on course only (i.e. part of the key) and not on the whole compound key, so we split this relation into two relations, each in 2NF:
STUDENT (student-no, course-no), COURSE (course-no, fee)
| STUDENT | COURSE | ||
|---|---|---|---|
| Student-no | Course-no | Student-no | Fee |
| S101 | C213 | C213 | 300 |
| S201 | C213 | C413 | 350 |
| S301 | C413 |
Examine each of these two relations for the modification anomalies mentioned above and check yourself that each is in 2NF.
C. 3NF: a relation is in 3NF if
1. the relation is in 2NF
2. there is no transitive dependency
Consider the following situation:
Young children ride buses between home and school. Each student pays a rate according to bus type: basic (B), loaded (L), fully loaded (FL). We can represent this situation by the following relation:
TRANSPORTATION (student-no, bus-type, rate)
| Student-no | Bus-type | Rate |
|---|---|---|
| S101 | B | 100 |
| S201 | B | 100 |
| S301 | L | 120 |
Now, this table is a relation, so it is in 1NF by definition. The key is a single attribute (i.e. there is no compound key) so it is in 2NF. But still this table suffers from modification anomalies. If student S301 moved from bus type L to B or changed schools and he/she is the only student using that bus, then we will have to delete that row (tuple). In this case we also deleted more information than we have to, we deleted a piece of information saying that there is a bus of type L and it costs $120 (a deleting anomaly). Also since no student uses the bus of type FL we cannot insert that piece of information in the relation because we cannot enter a null value in the student-no attribute (an inserting anomaly). These anomalies occurred because of transitive dependency. Student-no determines bus-type, bus-type determines rate, hence, student-no determines rate (since rate is associated mainly with bus-type but the key is student-no). To put this relation in 3NF we split it into:
STUDENT (student-no, bus-type) , BUS (bus-type, rate)
| STUDENT | BUS | ||
|---|---|---|---|
| Student-no | Bus-type | Bus-type | Rate |
| S101 | B | B | 100 |
| S201 | B | L | 120 |
| S301 | L | FL | 150 |
Examine these two relations yourself and make sure that they are in 1NF, 2NF and that each has no transitive dependency, making them in 3NF.
D. BYNF: a relation is in BCNF if
1. the relation is in 3NF
2. every determinant is a candidate key.
Consider the following situation:
Students in a school are asked to join activities supervised and trained by coaches. A student may join more than one activity. A coach may train students in only one activity. An activity may have more than one coach (if many students choose that activity, then they are divided into groups, each group supervised by a coach). The following relation represents this situation.
ST-AC-CO (student-no, activity, coach)
| Student-no | Activity | Coach |
|---|---|---|
| S101 | A1 | C1 |
| S101 | A2 | C2 |
| S202 | A1 | C1 |
| S303 | A1 | C3 |
| S303 | A2 | C2 |
| S303 | A3 | C4 |
| S404 | A3 | C5 |
This relation is in 1NF by definition, in 2NF since it is in 1NF and every non-key attribute depends on the whole key, in 3NF since it is in 2NF and has no transitive dependency. But it still has modification anomalies. If S303 is not playing A1 any more and we deleted the 4th row (tuple), then we have also deleted the piece of information telling as that there is a coach C3 ready to train students who enroll in A1 (deleting anomaly). Also, if we have coach C6 who is ready is train A4 but no student is enrolled in A4, then we cannot insert that piece of information about C4 (inserting anomalies). The attribute COACH is a determinant since it determines activity (not the opposite). To put the above relation in BCNF we have to split it into:
ST-AC (student-no, activity), AC-CO (activity, coach)
| ST-AC | AC-CO | ||
|---|---|---|---|
| Student-no | Activity | Activity | >Coach |
| S101 | A1 | A1 | C1 |
| S101 | A2 | A2 | C2 |
| S202 | A1 | A1 | C3 |
| S303 | A1 | A3 | C4 |
| S303 | A2 | A3 | C5 |
| S303 | A3 | ||
| S404 | A3 | ||
| S404 | A2 |
Examine ST-AC and AC-CO for being in the 1NF, 2NF, 3NF and BCNF and being free of the modification anomalies that exist in ST-AC-CO. Can you check to see whether dividing ST-AC-CO into ST-AC and ST-CO will be an acceptable alternative to ST-AC and AC-CO?
A. 4NF: a relation is in 4NF if
1. it is in BYNF
2. has no multi-valued independent attributes (i.e. there are no multi-valued attributes, or if multi-valued attributes exist, then they are functionally dependent).
Consider the following situation: Students in schools are encouraged to have hobbies outside their studies. We want to record the courses and hobbies of students. A student may have more than one course and more than one hobby. This situation is represented by the relation ST-CO-HO-A.
Several points may be noticed about this relation:
1. redundancy: for student S101 who has two courses and one hobby, we repeated the value of his/her hobby because we cannot leave this second row with a null value in the hobby-name column.
2. deleting anomaly: if S210 dropped course C213, and we deleted the row representing that, then we deleted also that he is playing football.
3. inserting anomaly: if S401 is enrolled in course C213 but does not have a hobby, we cannot record this information because we cannot enter a null value in the hobby-name attribute.
4. false impressions: for S301, one may have the impression that C413 encourages mental activities since chess is with it in the same row, while C213 encourages physical activities as basketball and swimming. In fact, there is no relationship between courses and hobbies, so to remove such a false impression,
we have to repeat each course with each hobby to end
with the relation ST-CO-HO-B.
ST-CO-HO-A (student-no, course-no, hobby-name), ST-CO-HO-B (student-no, course-no, hobby-name)
| A | ||
|---|---|---|
| ST-CO-HO-A | ||
| Student-no | Course-no | Hobby-name |
| S101 | C213 | billiards |
| S101 | C413 | billiards |
| S201 | C213 | football |
| S301 | C413 | chess |
| S301 | C213 | basketball |
| S301 | C213 | swimming |
| S301 | C213 | chess |
| S301 | C213 | basketball |
| B | ||
|---|---|---|
| ST-CO-HO-B | ||
| Student-no | Course-no | Hobby-name |
| S101 | C213 | billiards |
| S101 | C413 | billiards |
| S201 | C213 | football |
| S301 | C413 | chess |
| S301 | C413 | basketball |
| S301 | C413 | swimming |
| S301 | C213 | chess |
| S301 | C213 | swimming |
Look at the redundancy in ST-CO-HO-B. We still have the anomalies mentioned above in points 1,2,3. This redundancy occurred because the two attributes (course-no and hobby-name) may have multiple values for each student and there is no relationship between the values of the attributes. ST-CO-HO-B is in 1NF, 2NF, 3NF and BCNF, but still has anomalies. 4NF resolves these anomalies, when we split ST-CO-HO-B into:
STU-CO (student-no, course-no), STU-HO (student-no, hobby-name)
| STU-CO | STU-HO | ||
|---|---|---|---|
| Student-no | Course-no | Student-no | Hobby-name |
| S101 | C213 | S101 | billiards |
| S101 | C413 | S201 | football |
| S201 | C213 | S301 | chess |
| S301 | C413 | S301 | basketball |
| S301 | C213 | S301 | swimming |
Examine these two relations yourself to make sure that they are free from the modification anomalies existing in ST-CO-HO-B. STU-CO and STU-HO are both in 4NF.