Prerequisites
Normal Forms: 1NF → BCNF, With Examples
Codd's design rules for tables — what each anomaly looks like, how the next normal form fixes it, and when to deliberately break the rules for performance.
TL;DR
Normal forms are a graduated set of rules for structuring relational tables to prevent redundancy and update anomalies. Codd defined the first three (1NF, 2NF, 3NF) in a 1971 follow-up to his foundational 1970 paper. Boyce-Codd Normal Form (BCNF) followed in 1974 after Ray Boyce pointed out a case 3NF missed. 4NF and 5NF exist but are rarely invoked. Most production schemas aim for 3NF or BCNF. Denormalizing below that — on purpose, for performance — is a deliberate engineering tradeoff, not a mistake. The point of knowing the rules is knowing which ones you’re breaking.
The Anomalies Normal Forms Prevent
Before walking through the normal forms, it’s worth seeing the problems they exist to solve. All of them come from redundancy — storing the same fact in multiple places, so that updates, insertions, or deletions can leave the database in an inconsistent state.
Consider a single-table design for customers and their orders:
┌────┬─────────────┬──────────┬─────────┬──────────┐
│ id │ customer │ city │ order_id│ amount │
├────┼─────────────┼──────────┼─────────┼──────────┤
│ 1 │ Ada │ London │ 1001 │ 250 │
│ 2 │ Ada │ London │ 1002 │ 430 │
│ 3 │ Turing │ London │ 1003 │ 80 │
└────┴─────────────┴──────────┴─────────┴──────────┘
Three anomalies:
- Update anomaly. If Ada moves to Paris, we have to update two rows. Miss one, and the database now says Ada lives in two cities at once.
- Insertion anomaly. We can’t record a customer who has no orders yet — there’s no row to put them in.
- Deletion anomaly. If we delete Turing’s only order (1003), we also delete the fact that Turing exists.
Normal forms are the formal rules for restructuring the data so these anomalies can’t happen.
First Normal Form (1NF)
Every column contains a single atomic value. No repeating groups, no nested structures.
Before 1NF — a table like this is not in 1NF:
┌────┬──────────┬──────────────────────────────┐
│ id │ customer │ phones │
├────┼──────────┼──────────────────────────────┤
│ 1 │ Ada │ "555-1234, 555-5678" │
│ 2 │ Turing │ "555-9999" │
└────┴──────────┴──────────────────────────────┘
The phones column stores a comma-separated list. This makes queries horrible (“find all customers with a given area code” requires substring matching), makes updates dangerous (inconsistent delimiters), and violates 1NF.
1NF fix: split the multi-valued attribute into its own table.
customers phones
┌────┬──────────┐ ┌────┬──────────┬──────────┐
│ id │ customer │ │ id │ cust_id │ number │
├────┼──────────┤ ├────┼──────────┼──────────┤
│ 1 │ Ada │ │ 1 │ 1 │ 555-1234 │
│ 2 │ Turing │ │ 2 │ 1 │ 555-5678 │
└────┴──────────┘ │ 3 │ 2 │ 555-9999 │
└────┴──────────┴──────────┘
The “atomic” requirement is interesting in modern SQL. Postgres lets you store arrays and JSON documents in a single column, which technically violates classical 1NF. In practice, it’s fine if you’re using those columns for data you never need to query piecewise. If you find yourself querying into an array, you should probably have normalized it out.
Second Normal Form (2NF)
1NF, plus: every non-key attribute depends on the whole primary key, not just part of it.
This rule only matters for tables with composite primary keys (primary keys made of multiple columns). Tables with single-column keys are trivially in 2NF.
Consider an order-items table where the primary key is (order_id, product_id):
┌──────────┬────────────┬───────────┬──────────────┬────────┐
│ order_id │ product_id │ product_name│ unit_price │ quantity│
├──────────┼────────────┼───────────┼──────────────┼────────┤
│ 1001 │ 42 │ Teapot │ 9.99 │ 2 │
│ 1001 │ 43 │ Mug │ 4.99 │ 4 │
│ 1002 │ 42 │ Teapot │ 9.99 │ 1 │
└──────────┴────────────┴───────────┴──────────────┴────────┘
Problem: product_name and unit_price depend only on product_id, not on the full (order_id, product_id) key. The teapot’s name and price get repeated every time it appears in an order.
2NF fix: pull product data out to its own table.
order_items products
┌──────────┬────────────┬────────┐ ┌─────┬───────────┬────────┐
│ order_id │ product_id │quantity│ │ id │ name │ price │
├──────────┼────────────┼────────┤ ├─────┼───────────┼────────┤
│ 1001 │ 42 │ 2 │ │ 42 │ Teapot │ 9.99 │
│ 1001 │ 43 │ 4 │ │ 43 │ Mug │ 4.99 │
│ 1002 │ 42 │ 1 │ └─────┴───────────┴────────┘
└──────────┴────────────┴────────┘
Now each product’s name and price live exactly once. Update the teapot’s price in one place; every order using it automatically reflects the change (well — assuming you don’t also record historical prices, which is a separate concern).
Third Normal Form (3NF)
2NF, plus: no non-key attribute depends on another non-key attribute (no transitive dependencies).
Consider an orders table:
┌────┬─────────┬──────────────┬──────────┐
│ id │ cust_id │ cust_name │ amount │
├────┼─────────┼──────────────┼──────────┤
│ 1001│ 101 │ Ada │ 250 │
│ 1002│ 101 │ Ada │ 430 │
│ 1003│ 102 │ Turing │ 80 │
└────┴─────────┴──────────────┴──────────┘
cust_name depends on cust_id, not directly on the primary key id. This is a transitive dependency: id → cust_id → cust_name. If Ada changes her name, we’d have to update every order row.
3NF fix: move customer name to a customers table, referenced by the order’s cust_id:
orders customers
┌────┬─────────┬────────┐ ┌─────┬──────────┐
│ id │ cust_id │ amount │ │ id │ name │
├────┼─────────┼────────┤ ├─────┼──────────┤
│1001│ 101 │ 250 │ │ 101 │ Ada │
│1002│ 101 │ 430 │ │ 102 │ Turing │
│1003│ 102 │ 80 │ └─────┴──────────┘
└────┴─────────┴────────┘
Three-table structure: customers, orders, join relationship via cust_id. This is where most production schemas settle. 3NF is the target for the majority of OLTP systems, and once you’ve achieved it, you’ve avoided the large majority of update anomalies.
Boyce-Codd Normal Form (BCNF)
3NF, plus: every functional dependency has a superkey as its left-hand side.
BCNF is a stricter version of 3NF that covers a case 3NF technically allows. Imagine a table tracking which professors teach which courses, with the constraint that each professor teaches only one course but multiple professors can teach the same course:
┌───────────┬────────────┬───────────────┐
│ student │ course │ professor │
├───────────┼────────────┼───────────────┤
│ Alice │ 6.001 │ Sussman │
│ Bob │ 6.001 │ Sussman │
│ Alice │ 6.004 │ Knight │
│ Carol │ 6.001 │ Abelson │
└───────────┴────────────┴───────────────┘
Primary key: (student, course). Since professor depends on the whole key (which professor depends on who the student is and which course), this is in 3NF.
But there’s a hidden dependency: professor → course. Sussman teaches 6.001. That’s a fact about Sussman, not about any particular student-course pair. If we want to record that Abelson will start teaching 6.001 next term, we have no place to put that fact without a student also being involved.
BCNF fix: split into two tables.
enrollments teaches
┌───────────┬────────────┐ ┌───────────┬────────────┐
│ student │ professor │ │ professor │ course │
├───────────┼────────────┤ ├───────────┼────────────┤
│ Alice │ Sussman │ │ Sussman │ 6.001 │
│ Bob │ Sussman │ │ Knight │ 6.004 │
│ Alice │ Knight │ │ Abelson │ 6.001 │
│ Carol │ Abelson │ └───────────┴────────────┘
└───────────┴────────────┘
Each table now has a functional dependency whose left side is a superkey.
BCNF was introduced by Codd and Boyce in 1974 as a refinement that catches cases 3NF missed. In practice, most 3NF schemas are also in BCNF — the cases where they differ are rare and often involve tables modeling business rules with multiple independent keys.
4NF: Multivalued Dependencies
BCNF, plus: no non-trivial multivalued dependencies.
Consider a table of employees with multiple skills and multiple languages, where these two attributes are completely independent:
┌────────────┬─────────┬──────────┐
│ employee │ skill │ language │
├────────────┼─────────┼──────────┤
│ Dylan │ Python │ English │
│ Dylan │ Python │ Spanish │
│ Dylan │ SQL │ English │
│ Dylan │ SQL │ Spanish │
└────────────┴─────────┴──────────┘
The Cartesian product of skills and languages gets stored. If Dylan learns a new language, we add two new rows (one for each existing skill). This is a multivalued dependency — employee →→ skill and employee →→ language — and 4NF says: split them.
employee_skills employee_languages
┌────────────┬─────────┐ ┌────────────┬──────────┐
│ employee │ skill │ │ employee │ language │
├────────────┼─────────┤ ├────────────┼──────────┤
│ Dylan │ Python │ │ Dylan │ English │
│ Dylan │ SQL │ │ Dylan │ Spanish │
└────────────┴─────────┘ └────────────┴──────────┘
4NF rarely gets invoked by name in production schema design, because the situation it prevents is relatively rare and most designers instinctively split independent attributes without thinking “4NF.”
5NF and 6NF
5NF (Project-Join Normal Form) handles cases where a relation can be decomposed into three or more smaller relations but not into two. It’s almost never invoked — the cases are arcane, and most practitioners don’t explicitly design for it.
6NF is a research-world form used mainly in temporal databases, where every fact is stored with a timestamp range. Outside of specialized contexts, 6NF is more of an academic object.
If you’re asking whether your schema is in 5NF, you’re working on a research paper, not a production database.
Deliberate Denormalization
Normalization is a default, not a law. There are legitimate reasons to deliberately break the rules.
Read-heavy workloads with infrequent updates benefit from denormalization. A data warehouse that stores transactions might denormalize customer location, product category, and currency onto the fact table, producing a star schema where queries can answer most questions without joins. Snowflake, BigQuery, Redshift — all have schemas that look absolutely nothing like 3NF by design.
Write-heavy, low-scale systems benefit from normalization. An OLTP database serving ~1000 writes/second with complex business rules is much easier to reason about in 3NF.
Caching and materialized views are a form of denormalization. You’re not changing the source-of-truth schema; you’re creating derived, denormalized copies for performance. This can be the best of both worlds if you can tolerate staleness.
A few rules for denormalizing responsibly:
- Keep the normalized source of truth. Denormalization is a cache, not the primary data.
- Make the denormalization explicit. Don’t copy a field “just in case”; copy it because there’s a specific query that needs it fast.
- Document the update paths. When the source data changes, something has to update the denormalized copy. If you don’t know what, the copy will drift.
- Benchmark before denormalizing. Most performance problems have other fixes (indexes, better queries, caching layers) that don’t require violating normalization.
Why the Forms Come in an Order
A notable property of the normal forms: they’re nested. BCNF implies 3NF, 3NF implies 2NF, 2NF implies 1NF. If you achieve BCNF, you’ve also satisfied every previous form.
This isn’t a coincidence. Each form defines a new kind of redundancy that the previous forms allowed. You can’t talk about transitive dependencies (3NF) if you haven’t first established that there’s a well-defined set of attributes to reason about (1NF). You can’t have a partial dependency (2NF) unless you have a composite key, which the existence of the primary key (1NF) establishes.
The forms were introduced chronologically because they address progressively subtler redundancy. 1970s databases struggled with 1NF (people were still using variable-length columns with embedded delimiters). By the late 1980s, 3NF was the default design pattern, and by the 1990s, people had mostly stopped re-deriving BCNF because standard design practice produced it automatically.
What the Normal Forms Actually Give You
The Relational Model post summarizes normalization as “rules for structuring tables to minimize redundancy and prevent inconsistencies.” That’s accurate as far as it goes. The deeper thing the normal forms provide is a design vocabulary — a set of concepts you can use to describe why a schema feels wrong.
“That’s an update anomaly” is a precise criticism, not a vague one. “This isn’t 2NF” is a specific, fixable problem. Having the language lets you diagnose schemas quickly and have structured conversations about why something is painful.
In production: aim for 3NF or BCNF as the default. Denormalize deliberately, with evidence, for specific workloads. Use the normal forms as a tool for thinking about data, not as a compliance checklist to pass.
Codd’s 1971 paper introducing the normal forms was largely a cleanup of his 1970 proposal. Fifty years later, it’s still the working vocabulary of schema design. That’s a rare property for a piece of 1970s computer science. It held up because it was less about databases and more about how data refuses to behave when you represent it redundantly — a truth that stays true regardless of what storage engine, query language, or industry you’re working in.