Prerequisites
Codd's 1970 Paper, Annotated
Walking through 'A Relational Model of Data for Large Shared Data Banks' — what 'relational' actually means mathematically, which ideas survived into SQL, and which quietly got dropped.
Table of Contents
TL;DR
Edgar Codd’s 1970 paper in Communications of the ACM redefined how computers store data. It’s also an unusually dense piece of writing — a twelve-page document that quietly drops mathematical definitions of relations, foreign keys, normal forms, and a full query language in the space of one tradeshow-sized coffee break. This post walks through what the paper actually said, what the precise definitions mean, what survived into SQL, what didn’t, and why Codd felt the need to fight for another decade after publication to get his own ideas implemented the way he intended.
The Setup
In 1970, Edgar Codd was a British-born mathematician working at IBM’s San Jose Research Lab. He’d been at IBM for almost two decades — long enough to have seen the first generation of hierarchical databases (IMS) and network databases (CODASYL) in production. He thought both were designed wrong. Not buggy — conceptually wrong.
The paper he published in June 1970, “A Relational Model of Data for Large Shared Data Banks,” is the argument for a different approach. It’s twelve pages, dense, mathematically precise, and clearly aimed at computer scientists rather than DBAs. IBM’s own database group largely ignored it for years.
What “Relational” Means
The paper opens with a definition that’s easy to gloss over:
Given sets S₁, S₂, …, Sₙ (not necessarily distinct),
Ris a relation on thesensets if it is a set ofn-tuples each of which has its first element from S₁, its second element from S₂, and so on.
Translated: a relation is a set of tuples, where each tuple has the same structure. Each “set” Sᵢ is called a domain — the set of all possible values a given position in the tuple can take. Example:
Customers is a relation on { CustomerID, Name, City }:
(101, "Ada Lovelace", "London")
(102, "Alan Turing", "London")
(103, "Grace Hopper", "New York")
Each tuple is an element of CustomerID × Name × City — a Cartesian product of three sets. The relation is a subset of that Cartesian product. Not every possible combination is in the relation; only the ones that actually represent valid data.
The paper is careful to distinguish relation (the mathematical object — a set of tuples) from relational schema (the description of the structure — the names and types of columns). In SQL, a table is roughly the stored representation of a relation, but with important differences.
The Distinction Codd Cared About: Duplicates and Order
Two details from the paper matter enormously and didn’t fully survive into SQL:
- A relation is a set. No duplicate tuples. Two rows with identical values collapse into one.
- A relation is unordered. There’s no “first” row. Queries that rely on physical ordering are cheating — you can only sort if you explicitly say how.
SQL relaxed both of these. A SQL table is technically a bag (multiset), not a set — duplicates are allowed. SQL has SELECT DISTINCT to restore the set property, which means the default is non-relational. Codd objected to this for the rest of his career. His later “Rule 1” (from Codd’s 12 Rules, published in the 1980s) explicitly required tables to be relations, not bags.
The ordering issue is subtler. SQL doesn’t guarantee row order without ORDER BY, but in practice implementations often return rows in insertion or storage order, and application code grows to depend on that. Codd’s paper was explicit: if your query needs sorted output, the query should specify the sort. Relying on the implementation is a category error.
Data Independence: The Goal of the Paper
The paper’s motivation — and its most consequential argument — comes early:
The prevailing model of present-day formatted data banks is the tree (or hierarchical) model… Such a technology forces programs to be aware of the particular ordering of the stored representation.
Codd’s claim was that hierarchical and network databases had a design flaw: application code had to know how the data was physically stored. If the storage changed, the code broke. Changing an index, partitioning a table, migrating to different hardware — any of these could require rewriting hundreds of programs.
His proposed fix was data independence: a clean separation between what data means and how it’s stored. Queries would describe the desired result; the database engine would figure out the physical access path. Application code would never need to know about indexes, storage layouts, or access methods.
This was a radical claim in 1970. It required two things that didn’t yet exist in any production database:
- A declarative query language that could describe data in terms of logical structure only.
- A query optimizer that could translate logical queries into physical access plans efficiently.
Codd’s paper sketches the first. The second became the core engineering challenge of System R (IBM’s 1974-79 research project) and every database that followed.
Relational Algebra
The paper introduces a set of operations on relations that together form relational algebra. These are the primitives a query language would eventually expose. Codd’s original list:
- σ (Selection) — pick the tuples that satisfy a condition.
σ city='London' (Customers)returns just the London customers. - π (Projection) — pick specific columns.
π name, city (Customers)returns a relation of just(name, city)pairs. - × (Cartesian product) — pair every tuple in one relation with every tuple in another.
- ⋈ (Join) — the operation Codd’s paper called the “natural join”: combine tuples from two relations that share attribute values.
- ∪, ∩, − (Union, intersection, difference) — standard set operations.
- ÷ (Division) — for “all X such that for every Y…” queries. This one rarely survives into SQL without gymnastics.
The paper proves that with just these primitives, you can express any first-order logic query over the data. That’s the mathematical basis for what we now call “relational completeness” — a query language that expresses everything relational algebra can.
SQL, when it shipped, was designed to be relationally complete. Its syntax is different from Codd’s notation — SELECT name FROM customers WHERE city = 'London' rather than π name (σ city='London' (Customers)) — but the expressive power is supposed to be equivalent.
Supposed to be. In practice, SQL’s deviations from pure relational algebra (bags, NULLs, implicit type coercion) mean some queries are expressible in one but not the other. Codd complained about this. Most practitioners don’t care.
Foreign Keys and Referential Integrity
The paper introduces the idea of foreign keys — attributes in one relation whose values must match primary-key values in another. This is the mechanism by which relationships between entities are represented without physical pointers.
The crucial property is referential integrity: if an orders tuple has customer_id = 101, there must actually be a customers tuple with id = 101. Deleting the customer should either cascade (delete the orders) or be prevented (the delete fails).
Hierarchical databases had referential integrity by construction — a child record simply couldn’t exist without its parent. Relational databases have to enforce it explicitly. Codd’s 1970 paper introduces the concept but leaves enforcement mostly unaddressed. That work wouldn’t land in SQL until FOREIGN KEY constraints were added in the 1989 SQL-89 standard.
Normal Forms (Implicit)
The 1970 paper doesn’t use the term “normal form” — that came in Codd’s 1971-72 follow-up papers. But the 1970 paper sets up the problem normal forms solve. It defines keys (attributes that uniquely identify tuples) and functional dependencies (attribute A determines attribute B if knowing A tells you B). It notes the danger of “anomalies” when attributes are redundantly stored.
By 1972, Codd had formalized First, Second, and Third Normal Form, and by 1974 he and Boyce had defined Boyce-Codd Normal Form to handle cases 3NF missed. The normal forms post walks through these in detail. The 1970 paper is where the seeds are planted.
What the Paper Didn’t Have
A few things that feel integral to relational databases today are not in the 1970 paper:
- SQL. The paper uses relational algebra notation throughout. SQL — then called SEQUEL — would be designed by Chamberlin and Boyce at IBM starting in 1974, explicitly to make Codd’s math usable by programmers.
- Indexes. The paper doesn’t describe index structures or physical storage. That was the point — the paper argued such details should be invisible to users. The actual indexing work (B-trees, hash indexes) was done separately.
- Transactions and ACID. Concurrency and failure recovery were handled in follow-up work, particularly the formalization of ACID by Härder and Reuter in 1983.
- SQL’s
NULL. Codd was, famously, ambivalent about NULL. He argued for a three-valued logic (true, false, unknown) but also thought NULL was overused in practice. Later he proposed multiple kinds of NULL (value unknown vs. value inapplicable), which SQL implementations did not adopt and which he didn’t entirely abandon.
What Didn’t Survive
A few parts of the paper got left behind:
- The “natural join” as default. Codd’s paper treated joins primarily as natural joins — automatically joining on attributes with the same name. SQL made this opt-in (via
NATURAL JOIN), and most practitioners use explicitJOIN ... ONclauses instead, because relying on naming conventions across tables is fragile. - Division. The
÷operator (relational division) is powerful but awkward. SQL doesn’t have a direct equivalent; expressing “find suppliers who supply all parts” requires nestedNOT EXISTSsubqueries or similar workarounds. - Pure set semantics. As discussed, SQL uses bag semantics.
- Codd’s specific approach to NULLs. His proposals for multi-valued NULL handling never made it into SQL.
- The “Update” operations the paper proposed. The paper includes some awkward discussions of how to handle deletion in presence of foreign keys that later got completely reworked.
Codd’s Own Frustrations
The Relational Model post mentions that IBM “sat on” the relational model for years. Codd himself was frustrated throughout the 1970s and 1980s that IBM shipped products (SQL/DS, then DB2) that he considered incompletely relational.
In 1985, he published “Codd’s 12 Rules” (actually 13, numbered 0 through 12) as an explicit checklist for what a relational database should do. The rules were deliberately strict — no commercial database fully met all of them. IBM’s DB2, Oracle, and every other vendor missed on various counts.
Codd’s 12 Rules were partly a marketing gambit against vendors claiming “relational” status while shipping products that bolted relational queries onto fundamentally non-relational internals. He spent much of the late 1980s going around giving talks where he scored commercial products against his rules. Vendors hated it.
The rules themselves were a useful purity test. But they also revealed the gap between theoretical elegance and practical engineering — building a database that strictly satisfies Codd’s rules isn’t necessarily the best database, just the most mathematically consistent. The industry mostly settled on “relational-ish” systems that made pragmatic tradeoffs Codd wouldn’t have approved of.
Why the Paper Held Up
Most 1970-era papers in computer science are of historical interest. Codd’s is still load-bearing infrastructure. Every SQL query you’ve ever written runs through a query optimizer that’s applying concepts from that paper. Foreign keys, joins, data independence — all of it traces directly to those twelve pages.
The reason it held up:
- It wasn’t implementation-specific. Codd described a mathematical model, not a data structure. That meant the ideas survived generational changes in hardware, storage, and software architecture.
- It made the right separation of concerns. Logical vs. physical, declarative vs. procedural. Every subsequent database system adopted that separation, because the alternative was rebuilding the same coupling mistakes CODASYL had made.
- It was provably complete. Relational algebra was known to express exactly first-order logic on structured data. That meant the query language wasn’t going to run out of expressive power in some unexpected way — which early hierarchical query languages sometimes did.
The Relational Model post summarizes the paper as “deceptively simple.” The “deceptive” is carrying a lot of weight. Under the surface is set theory, first-order logic, functional dependency theory, and a precise argument about what a database is — reasoning careful enough that fifty-six years of practical engineering has mostly been filling in details, not revisiting the plan.