keyboard-shortcut
d

database normalization

5min read

an image

Database Normalization

Database normalization is a foundational concept in relational database design that helps structure data to minimise redundancy and ensure integrity. We often implement some of these design patterns, but I've found the "Normal Forms" helpful for verbalising and discussing these patterns.

What is Database Normalization?

Normalization is the process of organizing database tables so they store data efficiently and consistently by eliminating unnecessary duplicates and ensuring logical dependencies. Proper normalization helps prevent issues like update, insertion, and deletion anomalies.

The Normal Forms Explained

1. First Normal Form (1NF)

Ensures values in each column are atomic (no lists or repeated groups).

Each table needs a primary key and all columns contain atomic (single) values.

1NF comes with some core criteria (that we typically take for granted as many databases handle this automatically for us) such as:

  • There must be a primary key.
  • Rows cannot contain duplicate primary keys.
  • Columns cannot contain mixed data types.
  • No repeating rows (data existing twice doesn't provide information).
  • The order of the rows does not hold any information (e.g. row 1 is the oldest datapoint and the last row is the most recent datapoint).

Example:

StudentID (pk) Name Courses
1 Alice Math, History
2 Bob Math

The Courses column contains non-atomic data (a repeating group).

The solution here would be to have a table dedicated to student information and a table to capture course enrollment information.

2. Second Normal Form (2NF)

All attributes must depend on the entire primary key (not just part of it). This prevents partial dependency issues.

StudentID (pk) CourseID (pk) StudentName CourseName EnrollmentDate
1 101 Alice Math 2026-01-02
1 102 Alice History 2026-01-05
2 101 Bob Math 2026-01-03

Here the composite key is (StudentID, CourseID), but StudentName depends only on StudentID and CourseName depends only on CourseID. This is a partial dependency, violating 2NF.

The better solution would be to have dedicated Student, Course and Enrollment tables.

Note that 2NF only applies to tables with composite primary keys. Be wary of tables that have a surrogate key though as this may hide the underlying relationships and uniqueness requirements.

3. Third Normal Form (3NF)

Every non-key attribute must depend only on the primary key.

Eliminates transitive dependencies (where one non-key field depends on another).

StudentID (pk) AdvisorID AdvisorOffice
1 A101 Rm 12
2 A102 Rm 15

AdvisorOffice depends on AdvisorID, which is not a key. AdvisorID itself depends on StudentID. This is a transitive dependency.

The better solution here is to have an Advisors table and an AdvisorAssignment table.

4. Fourth Normal Form (4NF)

Eliminates multivalued dependencies. These are situations where one key value relates to multiple independent sets of values.

Model (pk) Color (pk) Supplier (pk)
2000 Green SupplierA
2000 Green SupplierB
2000 Yellow SupplierA
2000 Yellow SupplierB
2010 Green SupplierA
2010 Green SupplierB
2010 Yellow SupplierA
2010 Yellow SupplierB
2010 Blue SupplierA
2010 Blue SupplierB
2025 Orange SupplierA
2025 Orange SupplierB
2025 Magenta SupplierA
2025 Magenta SupplierB
2025 Gold SupplierA
2025 Gold SupplierB

If the Color and Supplier information are independent of each other then they shouldn't be in the same table. In this example for Color and Supplier to be independent:

  • Each model has a specific range of colors.
  • Each model can be produced by a range of suppliers.
  • Every supplier must be capable of creating each color.

The better solution here would be to have a ModelColor table and a ModelSupplier table.

Note that if the information is "dependent" we no longer violate 4NF.

Model (pk) Color (pk) Supplier (pk)
2000 Green SupplierA
2000 Yellow SupplierB

The model is available in Green from SupplierA or Yellow from SupplierB. We cannot split this data into separate tables as the attributes are dependent on one another.

5. Fifth Normal Form (5NF)

Ensures a table cannot be reconstructed from smaller join results meaning the structure is fully decomposed.

Aims for maximum flexibility and minimal redundancy.

Supplier (pk) Part (pk) Project (pk)
Smith Bolt Bridge
Smith Nut Dam
Jones Bolt Bridge
Jones Nut Dam
Jones Cam Tower

Here the fields are dependent. The part is made by a specific supplier for a certain project. This means it does not violate 4NF. It does violate 5NF though because we can recreate the data using smaller joins.

Supplier (pk) Part (pk)
Smith Bolt
Smith Nut
Jones Bolt
Jones Nut
Jones Cam
Supplier (pk) Project (pk)
Smith Bridge
Smith Dam
Jones Bridge
Jones Dam
Jones Tower
Part (pk) Project (pk)
Bolt Bridge
Nut Dam
Cam Tower

The "proper" solution here is to use these normalized tables.

Normalizing this information isn't always beneficial (as you can see from this example).

Why It Matters

  • Cleaner data: Fewer duplicates, stronger data integrity.
  • Easier maintenance: Changes in one place propagate correctly.
  • Better performance: We rely on the power of joins to retrieve the appropriate data.
  • Scalability: Structured design handles growth gracefully.

Normalization is essential for robust database design for any relational database.