Database Normalization
Database normalization is a foundational concept in relational database design that helps structure data to minimise redundancy and ensure integrity. We often implement some of these design patterns, but I've found the "Normal Forms" helpful for verbalising and discussing these patterns.
What is Database Normalization?
Normalization is the process of organizing database tables so they store data efficiently and consistently by eliminating unnecessary duplicates and ensuring logical dependencies. Proper normalization helps prevent issues like update, insertion, and deletion anomalies.
The Normal Forms Explained
1. First Normal Form (1NF)
Ensures values in each column are atomic (no lists or repeated groups).
Each table needs a primary key and all columns contain atomic (single) values.
1NF comes with some core criteria (that we typically take for granted as many databases handle this automatically for us) such as:
- There must be a primary key.
- Rows cannot contain duplicate primary keys.
- Columns cannot contain mixed data types.
- No repeating rows (data existing twice doesn't provide information).
- The order of the rows does not hold any information (e.g. row 1 is the oldest datapoint and the last row is the most recent datapoint).
Example:
| StudentID (pk) |
Name |
Courses |
| 1 |
Alice |
Math, History |
| 2 |
Bob |
Math |
The Courses column contains non-atomic data (a repeating group).
The solution here would be to have a table dedicated to student information and a table to capture course enrollment information.
2. Second Normal Form (2NF)
All attributes must depend on the entire primary key (not just part of it). This prevents partial dependency issues.
| StudentID (pk) |
CourseID (pk) |
StudentName |
CourseName |
EnrollmentDate |
| 1 |
101 |
Alice |
Math |
2026-01-02 |
| 1 |
102 |
Alice |
History |
2026-01-05 |
| 2 |
101 |
Bob |
Math |
2026-01-03 |
Here the composite key is (StudentID, CourseID), but StudentName depends only on StudentID and CourseName depends only on CourseID. This is a partial dependency, violating 2NF.
The better solution would be to have dedicated Student, Course and Enrollment tables.
Note that 2NF only applies to tables with composite primary keys. Be wary of tables that have a surrogate key though as this may hide the underlying relationships and uniqueness requirements.
3. Third Normal Form (3NF)
Every non-key attribute must depend only on the primary key.
Eliminates transitive dependencies (where one non-key field depends on another).
| StudentID (pk) |
AdvisorID |
AdvisorOffice |
| 1 |
A101 |
Rm 12 |
| 2 |
A102 |
Rm 15 |
AdvisorOffice depends on AdvisorID, which is not a key. AdvisorID itself depends on StudentID. This is a transitive dependency.
The better solution here is to have an Advisors table and an AdvisorAssignment table.
4. Fourth Normal Form (4NF)
Eliminates multivalued dependencies. These are situations where one key value relates to multiple independent sets of values.
| Model (pk) |
Color (pk) |
Supplier (pk) |
| 2000 |
Green |
SupplierA |
| 2000 |
Green |
SupplierB |
| 2000 |
Yellow |
SupplierA |
| 2000 |
Yellow |
SupplierB |
| 2010 |
Green |
SupplierA |
| 2010 |
Green |
SupplierB |
| 2010 |
Yellow |
SupplierA |
| 2010 |
Yellow |
SupplierB |
| 2010 |
Blue |
SupplierA |
| 2010 |
Blue |
SupplierB |
| 2025 |
Orange |
SupplierA |
| 2025 |
Orange |
SupplierB |
| 2025 |
Magenta |
SupplierA |
| 2025 |
Magenta |
SupplierB |
| 2025 |
Gold |
SupplierA |
| 2025 |
Gold |
SupplierB |
If the Color and Supplier information are independent of each other then they shouldn't be in the same table. In this example for Color and Supplier to be independent:
- Each model has a specific range of colors.
- Each model can be produced by a range of suppliers.
- Every supplier must be capable of creating each color.
The better solution here would be to have a ModelColor table and a ModelSupplier table.
Note that if the information is "dependent" we no longer violate 4NF.
| Model (pk) |
Color (pk) |
Supplier (pk) |
| 2000 |
Green |
SupplierA |
| 2000 |
Yellow |
SupplierB |
The model is available in Green from SupplierA or Yellow from SupplierB. We cannot split this data into separate tables as the attributes are dependent on one another.
5. Fifth Normal Form (5NF)
Ensures a table cannot be reconstructed from smaller join results meaning the structure is fully decomposed.
Aims for maximum flexibility and minimal redundancy.
| Supplier (pk) |
Part (pk) |
Project (pk) |
| Smith |
Bolt |
Bridge |
| Smith |
Nut |
Dam |
| Jones |
Bolt |
Bridge |
| Jones |
Nut |
Dam |
| Jones |
Cam |
Tower |
Here the fields are dependent. The part is made by a specific supplier for a certain project. This means it does not violate 4NF. It does violate 5NF though because we can recreate the data using smaller joins.
| Supplier (pk) |
Part (pk) |
| Smith |
Bolt |
| Smith |
Nut |
| Jones |
Bolt |
| Jones |
Nut |
| Jones |
Cam |
| Supplier (pk) |
Project (pk) |
| Smith |
Bridge |
| Smith |
Dam |
| Jones |
Bridge |
| Jones |
Dam |
| Jones |
Tower |
| Part (pk) |
Project (pk) |
| Bolt |
Bridge |
| Nut |
Dam |
| Cam |
Tower |
The "proper" solution here is to use these normalized tables.
Normalizing this information isn't always beneficial (as you can see from this example).
Why It Matters
- Cleaner data: Fewer duplicates, stronger data integrity.
- Easier maintenance: Changes in one place propagate correctly.
- Better performance: We rely on the power of joins to retrieve the appropriate data.
- Scalability: Structured design handles growth gracefully.
Normalization is essential for robust database design for any relational database.