Vous êtes sur la page 1sur 15

Database Normalization Basics

ve you reached the point of having seen the term "normalization" used database literature, but you are
still unsure as to just what you have to do to get a normalized database? If so, then youre not alone.
We will be showing, in this article, how Normal Forms keep data integrity. Some people will argue that
it is all right to "denormalize" a schema for efficiency reasons. If we denormalize the schema, then we
are either not enforcing some of the user-required constraints or we have to write addition code to
enforce the constraint. If we choose to write additional code, we have to duplicate the functionality in
every application, present and future. Normalization reduces programming effort; rules are enforced in
one place, one way, one time.
Lets start in a simple way with some general heuristics about normalization.

Heuristics are exact and often sound a little funny when you say them. One of my favorites is if you cannot
think of it as a percentage, think of it as a ratio! (From a mathematical view point, a ratio and percentage
are the same. But percentages have a base while the two things in a ratio are seen as equal in some
sense).

Mother Celko's Thirteen Normalization Heuristics

Does the table model either a set of one and only one kind of entity or one and only one
relationship. This is what I call disallowing a Automobiles, Squids and Lady GaGa table.
Following this rule will prevent Multi-valued dependencies (MVD) problems and it is the basis
for the other heuristics.

Does the entity table make sense? Can you express the idea of the table in a simple collective
or plural noun? To be is to be something in particular; to be everything in general or nothing in
particular is to be nothing at all (this is known as the Law of Identity in Greek logic). This is why
EAV does not work it is everything and anything.

Do you have all the attributes that describe the thing in the table? In each row? The most
important leg on a three-legged stool is the leg that is missing.

Are all the columns scalar? Or is a column serving more than one purpose? Did you actually
put hat size and shoe size in one column? Or store a CSV list in it?

Do not store computed values, such as (unit_price * order_qty). You can compute these things
in VIEWs or computed columns.

Does the relationship table make sense? Can you express the idea of the table in a simple
sentence, or even better, a name for the relationship? The relationship is marriage and not
person_person_legal_thing

Did you check to see if the relationship is 1:1, 1:m or n:m? Does the relationship have
attributes of its own? A marriage has a date and a license number that does not belong to
either of the people involved. This is why we don't mind tables that model 1:1 relationships.

Does the entity or relationship have a natural key? If it does, then you absolutely have to
model it as the PRIMARY KEY or a UNIQUE constraint. Is there a standard industry identifier
for it? Let someone else do all that work for you.

If you have a lot of NULL-able columns, the table is probably not normalized.

The NULLs could be non-related entities or relationships.

Do the NULLs have one and only one meaning in each column?

If you have to change more than one row to update, insert or delete a simple fact, then the
table is not normalized.

Did you confuse attributes, entities and values? Would you split the Personnel table into
Male_Personnel and Female_Personnel by splitting out the sex code? No, sex is an
attribute and not an entity. Would you have a column for each shoe size? No, a shoe size is a
value and not an attribute.

The Writing on the Wall


Many years ago, there was a magazine named Database Programming & Design which published a
poster on Normalization by Marc Rettig as a subscription renewal premium. It was so popular that it is
still around after all this time. It follows one example, the Daisy Hill Puppy Farm database (this is the
birth place of Snoopy in the Peanuts comic strip).
You really ought to have a copy hanging in your cube. It is that good. It is called '5 Rules of Database
Normalization'.

You

can

now

find

it

here:

http://www.databaseanswers.org/downloads/Marc_Rettig_5_Rules_of_Normalization_Poster.pdf

The Relational Model


The Relational Model and Normal Forms of the Relational Model were first defined by Dr. E. F. Codd
then extended by other writers after him. He borrowed the term "normalized relations" from the US and
Soviet political jargon back when the USSR still existed.
Relations are a branch of mathematics which deal with mappings among sets defined by predicate
calculus from formal logic. Just like an algebraic equation there are many forms of the same relational
statement but the "normal forms" of relations are certain formally defined desirable constructions. The
goal of Normal Forms is to avoid certain data anomalies that can occur in un-normalized tables. Data
anomalies are easier to explain with examples. When Dr. Codd defined the relational model it added
some additional requirements to the visualization of the relation as a table:
1.

If a row in a table can exist without a certain attribute being known, then this is considered a
NULL state for that attribute and the DBMS must use a special symbol for that state. We have
that in SQL, but NULLs have a data type to signal the SQL engine how to handle the physical
storage.

2.

No two rows are identical. Furthermore it must be possible to define a subset of the known
attributes of a relation which have no identical values. This subset of attributes is known as a
"key". SQL was built on old file systems, and the language can allow redundant duplicates. But
that is a sign of really bad programming

3.

One key in a relation is the "primary key". This key cannot have any NULL attributes, of
course. This initial definition was later modified to regard all keys as equally key-like, but the
notion was set in the early literature and it got into SQL. The actual motive was from sequential
files; a sequential tape file has to be sorted on a sort key to work. ISAM files on disk also had

to have a sort key. This is why T-SQL has clustered indexes and and clustering as the default
for the PRIMARY KEY. The new technology mimics to old technology, until it learns better and
has its own voice.
4.

The order of rows and columns are irrelevant and do not contain any information. This leads us
to the requirement that any atomic fact in the database (a collection of such tables) can be
found

by

table

name,

column

names,

and

primary

key

value.

Again, SQL assumes that it can go to the ANSI/ISO Standard Schema Information meta-data
tables and get a default list of columns to fill in the column lists in various statements. It is a
shorthand.
Anyone tackling relational databases who is familiar only with file-based data will assumes
that there is a physical ordering because that is how all of the flies he has ever worked with are
stored -- physically contiguous tracks on a disk. Nope. Wrong. Columnar databases are not
contiguous storage. Neither is a hashed database. The first thing you have to do to learn SQL
is throw out a physical model mindset and think in abstractions.
Dr. Codd added quite a few other requirements (initially a total of 12 rules and eventually 333) but those
listed here will suffice for our discussion.

Anomalies
If you do not normalize a table, you get data anomalies. These are situations where data integrity
cannot be maintained without extra code, NULLs and convoluted programming. Let's create a simple
example. The table holds information about college students, their majors and their departmental
advisers.
CREATE TABLE StudentMajors
(student_name CHAR(10) NOT NULL,
major CHAR(10) NOT NULL,
adviser_name CHAR(10) NOT NULL,
PRIMARY KEY (student_name, major, adviser_name));

INSERT INTO StudentMajors


VALUES ('Higgins', 'Comp Sci', 'Celko'),
('Jones', 'Comp Sci', 'Celko'),
('Smith', 'English', 'Wilson');

UPDATE Anomaly
Let's try to update Higgins to change from Computer Science to Sanskrit and suddenly Celko is
advising Sanskrit!

UPDATE StudentMajors
SET major = 'Sanskrit'
WHERE student_name = 'Higgins';

The actual erroneous row is ('Higgins', 'Sanskrit', 'Celko')

DELETE Anomaly
Mr. Smith drops out of school and Professor Wilson loses his job!!
DELETE FROM StudentMajors
WHERE student_name = 'Smith';

The row ('Smith', 'English', 'Wilson') happens to be the only row that tells us that Professor
Wilson is the English department's faculty adviser. If this sample table and the English department had
been bigger, this problem could have been hidden for a long time. But we would have been carrying
redundant data.

INSERT Anomaly
Mr. F. L. Wright starts an Architecture department, but we cannot put him into the database without
using NULLs or a dummy student name.
INSERT INTO StudentMajors (student_name, major, adviser_name)--- fails!
VALUES (NULL, 'Architecture', 'Wright')'

SQL will prevent this row from getting into the table the ways I declared this example. It is bad practice
to declare an IDENTITY as the key and all other columns are NULL-able. Since the table's IDENTITY
property is not a column nor an attribute of any possible data model, it cannot ever be used as a key. It
is the count of the physical insertion attempts. You can use it to mimic the record numbers that are so
familiar from using sequential files, but they have no place in the relational model.

First, Second, Third, and Boyce-Codd Normal Forms


Let's talk about going from a file system to an RDBMS. Let's start with a file to maintain data about
class schedules at an imaginary school. We keep the course_nbr, section_nbr, time_period,
room_nbr,

professor_name,

student_name,

student_major, and course_grade.

Assume we started with a Pascal file, declared like this, and wanted to move it to a relational database.
I picked Pascal because it is easy to read and understand, unlike currently popular languages.
RECORD Classes =
course_nbr: ARRAY [1:7] OF CHAR;
section: CHAR;
time_period: INTEGER;

room_nbr: INTEGER;
room_size: INTEGER;
professor_name: ARRAY [1:25] OF CHAR;
Students : ARRAY [1:class_size]
OF RECORD
student_name ARRAY [1:25] OF CHAR;
student_major ARRAY [1:10] OF CHAR;
course_grade CHAR;
END;
END;

To convert this sequential file of records into a table, we first have to "flatten" it out, making sure that
each row has full information for all the columns. Since Pascal, or any other language, does not have
NULLs, the columns will be NOT NULL. The first thought is to simply rewrite every field over as a
column, a line-for-line translation, as I said earlier. It does not really work because a column is nothing
like a field; did you notice that Students is itself an array of records whose size is determined by an
external variable? This would have been done with an OCCURS clause in COBOL and other
programming language have similar features.
But let's do anyway.
CREATE TABLE Classes
(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
professor_name VARCHAR(25) NOT NULL,
student_name VARCHAR(25) NOT NULL,
student_major VARCHAR(10) NOT NULL,
PRIMARY KEY (course_nbr, section_nbr, student_name));

We declare PRIMARY KEY (course_nbr, section_nbr, student_name) so we have a proper


table. But what we are doing is hiding the Students record array, which has not changed its nature by
being flattened.
To get this table in 1NF we "decompose" it into two or more tables so that no table has repeating
groups, hidden or otherwise.
CREATE TABLE Classes
(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
professor_name VARCHAR(25) NOT NULL,
PRIMARY KEY (course_nbr, section_nbr));

CREATE TABLE StudentCourses


(student_name VARCHAR(25) NOT NULL,
course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
student_major VARCHAR(10) NOT NULL,
course_grade CHAR(1) NOT NULL,
PRIMARY KEY (student_name, course_nbr, section_nbr),
FOREIGN KEY (course_nbr, section_nbr)
REFERENCES Classes (course_nbr, section_nbr));

Since we were told that many students could sign up for a single course we know that the student
information was a repeating group. Note that the foreign key declaration allows us to enforce the
constraint that a student may only sign up for a valid course and section. Files do not do that; they are
raw data without any constraints that get meaning from the host program that reads them.

With the second table structure there are no repeating groups but there are other problems. For
example, what if a student changes his/her major? We need to changestudent_major for every
(course_nbr, section_nbr) that student_name is enrolled in.
We now need to introduce the concept of a functional dependency (FD) or determiner to borrow a
British term for the same idea. The notation is "X Y", and the arrow is read "X determines Y" in
English, where X and Y are sets of attributes that will be used to define columns in tables. A functional
dependency means that if I know a value of X, then I can determine a value of Y, just as you would in a
mathematical function.
A table is in Second Normal Form (2NF) if it has no "partial key" dependencies. That is, if X and Y are
columns or subsets of columns and X is a key then for any Z which is proper subset of X, it cannot be
the

case

that

Y.

In

the

example,

our

users

tell

us

that

knowing

the

(student_name, course_nbr) is sufficient to determine thesection_nbr (since students cannot


sign up for more than one section of the same course) and course_grade. This is the same as
saying that (student_name, course_nbr) (section_nbr, course_grade). After more
analysis we also discover from our users that student_name student_major; a student has
only one major, in English. This leads us to the following decomposition:
CREATE TABLE Classes
(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
time_period INTEGER NOT NULL,
professor_name VARCHAR(25) NOT NULL,
PRIMARY KEY (course_nbr, section_nbr));

CREATE TABLE Enrollment


(student_name VARCHAR(25) NOT NULL,
course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
course_grade CHAR(1) NOT NULL)
PRIMARY KEY (student_name, course_nbr),

FOREIGN KEY (course_nbr, section_nbr)


REFERENCES Classes (course_nbr, section_nbr),
FOREIGN KEY (student_name)
REFERENCES Students (student_name));

CREATE TABLE Students


(student_name VARCHAR(25) NOT NULL PRIMARY KEY,
student_major);

At this point, we are in 2NF. Every attribute depends on the entire key. Now if a student changes majors
it can be done in one place. Furthermore, a student cannot sign up for different sections of the same
class because we changed the key of Enrollment. Unfortunately, we have still have problems.
Notice that while room_size depends on the entire key of Classes it also depends on room_nbr. If
the room_nbr is changed for a (course_nbr, section_nbr) we may also have to change
the room_size and if the room_nbr is modified (we knock down a wall) we may have to
change room_size in several rows in Classes for that room_nbr. Obviously, the room size is a
property of the room.
Another normal form can address these problems. A table is in Third Normal Form (3NF) if for all X
Y where X and Y are columns of a table then X is a key or Y is part of a "candidate" key (a key which
is not necessarily the primary key). This implies that the table in 2NF since a partial key dependency is
a type of transitive dependency. To get our example into 3NF we make the following decomposition:
CREATE TABLE Rooms
(room_nbr INTEGER NOT NULL PRIMARY KEY,
room_size INTEGER NOT NULL);

CREATE TABLE Classes


(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
room_nbr INTEGER NOT NULL,
time_period INTEGER NOT NULL,

PRIMARY KEY (course_nbr, section_nbr),


FOREIGN KEY (room_nbr)
REFERENCES Rooms (room_nbr));

CREATE TABLE Enrollment


(student_name VARCHAR(25) NOT NULL,
course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
course_grade CHAR(1) NOT NULL,
PRIMARY KEY (student_name, course_nbr),
FOREIGN KEY (course_nbr, section_nbr)
REFERENCES Classes (course_nbr, section_nbr),
FOREIGN KEY (student_name)
REFERENCES Students (student_name));

CREATE TABLE Students


(student_name VARCHAR(25) NOT NULL PRIMARY KEY,
student_major VARCHAR(10) NOT NULL);

A common misunderstanding about relational theory is that 3NF has no transitive dependencies. As
indicated above, if X Y then X does not have to be a key. If Y is part of a candidate key. We still have
a transitive dependency in the example - (room_nbr, time_period) (course_nbr,
section_nbr) - but since the right side of the dependency is a key in is technically in 3NF. The
unreasonable behavior this table structure still has is that several courses can be assigned to the
same room_nbr at the sametime_period.

Boyce-Codd Normal Form (BCNF)


The normal form that actually removes all transitive dependencies is Boyce-Codd Normal Form or
BCNF. A table is in BCNF if for all X Y, X is a key; period. We can go to this normal form just by
adding another key with UNIQUE (room_nbr, time_period) constraint clause to the table

Classes. There are some other interesting and useful "higher" normal forms but that is out of the scope
of this discussion. In our example we have removed all of the important anomalies with BCNF.
A table can be in 3NF, but still have problems. Let's declare an abbreviated table for mailing labels.
CREATE TABLE Mailing_Labels
(street_address VARCHAR(35) NOT NULL,
city_name VARCHAR(25) NOT NULL,
zip_code CHAR(5) NOT NULL
CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'),
PRIMARY KEY (street_address, city_name)
);

The nature of US mailing addresses and the ZIP code gives us these two functional dependencies,
ignoring the fact that many cities have the same name within different states.
(street_address, city_name) zip_code
zip_code city_name

The first FD can give us a key, but we are not enforcing the second one. Consider this:
INSERT INTO Mailing_Labels
VALUES ('100 Main Street', 'Springfield', '99999'),
('100 Main Street', 'Yorktown', '99999');

In this example, we have to add another table to implement the second FD. Notice the overlapping
unique constraints.
CREATE TABLE Zip_Cities
(city_name VARCHAR(25) NOT NULL,
zip_code CHAR(5) NOT NULL UNIQUE
CHECK(zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]',
PRIMARY KEY (city_name, zip_code));

and modify the original table:

CREATE TABLE Mailing_Labels


(street_address VARCHAR(35) NOT NULL,
city_name VARCHAR(25) NOT NULL,
zip_code CHAR(5) NOT NULL
CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'),

PRIMARY KEY (street_address, city_name),


FOREIGN KEY (city_name, zip_code)
REFERENCES Zip_Cities (city_name, zip_code));

This is a bit elaborate, but it is all declarative code. Using overlapping constraints is a good technique
that does not get used as often as it should. Let's go to the next normal form. BCNF removes all the
transitive dependencies, not 3NF.

Fourth Normal Form


A table in BCNF can still have problems with multi-valued dependencies. Here is another overly simple
table.
CREATE TABLE Inventory
(item_nbr CHAR(13) NOT NULL,
dept_name CHAR(15) NOT NULL,
supplier_name CHAR(20) NOT NULL,
PRIMARY KEY (item_nbr, dept_name, supplier_name));

It is all key, so we have a BCNF table. Now let's enforce the following two multi-valued dependencies.
item_nbr dept_name
item_nbr supplier_name

The MVDs say that an item can be in one or more departments --- every department has trash cans.
The second multi-valued dependencies say that an item can have different suppliers -- trash can
come from many suppliers. But there is a problem with this table. Imagine that I get a new shiny red
trash can. I can not do anything with it until I find some department that wants a shiny red trash can. I
need to recognize that MVD constraints on the departments and suppliers are independent.

CREATE TABLE Department_Inventory


(item_nbr CHAR(13) NOT NULL,
dept_name CHAR(15) NOT NULL,
PRIMARY KEY (item_nbr, dept_name));

CREATE TABLE Suppliers_Inventory


(item_nbr CHAR(13) NOT NULL,
supplier_name CHAR(7) NOT NULL,
PRIMARY KEY (item_nbr, supplier_name));

Fifth Normal Form


You can have a n-ary relationship that cannot be decomposed into lower degree relations. If you ever
have had a mortgage or made any major purchase that involved a lender, then this Shelton will look
familiar.
CREATE TABLE Escrows
(buyer_name CHAR(15) NOT NULL,
seller_name CHAR(15) NOT NULL,
lender_name CHAR(15) NOT NULL,
PRIMARY KEY (buyer_name, seller_name, lender_name));

INSERT INTO Escrows(buyer_name, seller_name, lender_name)


VALUES ('Smith', 'Jones ', 'National Bank'),
('Smith ', 'Wilson', 'Home Bank'),
('Nelson', 'Jones ', 'Home Bank');

Let's try to split this into three binary relations.


(CREATE TABLE Buyer_Seller
(buyer_name CHAR(15) NOT NULL,

seller_name CHAR(15) NOT NULL


PRIMARY KEY (buyer_name, seller_name));

CREATE TABLE Buyer_Lender


(buyer_name CHAR(15) NOT NULL,
lender_name CHAR(15) NOT NULL,
PRIMARY KEY (buyer_name, lender_name));

CREATE TABLE Seller_Lender


(seller_name CHAR(15) NOT NULL,
lender_name CHAR(15) NOT NULL,
PRIMARY KEY (seller_name, lender_name));

The trouble is that when you try to assemble the original information by joining pairs of these three
tables together, thus:
SELECT BS.buyer_name, SL.seller_name, BL.lender_name
FROM BuyerLender AS BL, SellerLender AS SL, BuyerSeller AS BS
WHERE BL.buyer_name = BS.buyer_name
AND BL.lender_name = SL.lender_name
AND SL.seller_name = BS.seller_name;

This will add false rows, such as ('Smith', 'Jones', 'Home Bank'). These three tables should have been
implemented as VIEWs and not as base tables

Other Normal Forms and non-NF Redundancies


There are more normalizations, but they involve temporal consideration and other things. There are
also redundancies that involve more than one table. The most common non-normal form redundancy is
a summary column in a referenced table. In English, imagine the usual skeleton of an Orders table and
its referencing order details table:
CREATE TABLE Orders

(order_nbr CHAR(10) NOT NULL PRIMARY KEY,


..,
order_amt_tot DECIMAL (12,2) NOT NULL,
CHECK(order_amt_tot > 0.00),
..);

CREATE TABLE Order_Details


(order_nbr CHAR(10) NOT NULL
REFERENCES Orders(order_nbr)
ON UPDATE CASCADE
ON DELETE CASCADE,
sku CHAR(15) NOT NULL,
PRIMARY KEY (order_nbr, sku),
unit_price DECMAL (8,2) NOT NULL
CHECK (unit_price > 0.00),
order_qty INTEGER NOT NULL
CHECK (order_qty > 0),
..);

where order_amt_tot is the total amount of the order as computed by the query:
SELECT order_nbr, SUM(unit_price * order_qty) AS order_amt_tot
FROM Order_Details
GROUP BY order_nbr;

People who write an Orders table like this will be updating it constantly, often with procedural code in a
TRIGGER or stored procedure.

So, do we have to learn a lot of math just to get a correct schema? Well, yes. But you can use some
basic heuristics and feel that you have gotten it right 95% of the time

Thank this author by sharing:


4

This article has been viewed 37525 times.

Vous aimerez peut-être aussi