Prerequisites
The JOIN: Relationships Through Values, Not Pointers
Expressing relationships through shared values instead of physical references was the real breakthrough of the relational model. Fifty years later, the four join types, three join algorithms, and one fundamental question still run most of the world's data.
TL;DR
In pre-relational databases, relationships between records were physical pointers — you navigated from a customer record to an order record by following a literal memory address the database writer had decided to store. In Codd’s relational model, relationships are logical: a shared value in two tables (the customer’s ID appears in both the customers table and the orders table), and the JOIN operation combines rows from the two tables on demand. This sounds like a minor change and is instead the difference between a database you can evolve and one you can’t. Joins come in four types (INNER, LEFT, RIGHT, FULL OUTER), execute with three main algorithms (nested loop, hash, sort-merge), and are the source of most performance problems in real database workloads.
Before the JOIN: Pointer Navigation
In CODASYL network databases and IMS hierarchical databases, relationships were explicit. A customer record had a physical pointer to the first of its order records. Each order record had a pointer to the next one. To find all of a customer’s orders, you wrote code that looked roughly like this:
customer = find_customer(id=101)
order = customer.first_order
while order is not None:
process(order)
order = order.next_order
This is efficient if the only question you ever want to ask is “give me customer 101’s orders.” It’s also the only question you can ask cheaply. To go from an order back to the customer, you need a back-pointer — and the back-pointer has to have been set up when the data was inserted. To find “all orders to London addresses” you have to walk every customer’s order list and filter.
Every new query pattern required new pointers. Every new pointer required schema changes, data migrations, and revised application code. Queries were fast when anticipated and impossible when not.
The Relational Alternative
Codd’s 1970 paper proposed the opposite: no pointers at all. A customer record has an ID. An order record has a customer_id column. The relationship is the fact that the two columns contain matching values. The data doesn’t physically know this — it’s a logical property of the data, one the database can discover and exploit at query time.
Customers Orders
┌────┬──────────────┬──────────┐ ┌────┬─────────┬──────┐
│ id │ name │ city │ │ id │ cust_id │ amt │
├────┼──────────────┼──────────┤ ├────┼─────────┼──────┤
│101 │ Ada Lovelace │ London │ │ 1 │ 101 │ 250 │
│102 │ Alan Turing │ London │ │ 2 │ 102 │ 80 │
│103 │ Grace Hopper │ New York │ │ 3 │ 101 │ 430 │
└────┴──────────────┴──────────┘ └────┴─────────┴──────┘
Implicit relationship: orders.cust_id matches customers.id
No pointer in the orders table says “this order belongs to this physical customer record.” The column cust_id just says “the customer for this order has whatever ID is 101.” Joining the tables on orders.cust_id = customers.id produces the combined view.
The payoff is flexibility. The same data can be joined in different directions for different queries:
- “Find all of Ada’s orders” — join from customers to orders.
- “Find the customer who placed order 3” — join from orders to customers.
- “Find all customers who ordered over $400” — join and filter.
- “Find the average order amount by city” — join and aggregate.
The database didn’t have to be designed with any of these queries in mind. The relationships are inherent in the data, and the query language discovers them.
The Four Join Types
SQL exposes four main join types, distinguished by what happens to rows that don’t have a match.
INNER JOIN
The default. Only rows that have matches in both tables appear in the result.
SELECT c.name, o.amount
FROM customers c
INNER JOIN orders o ON o.cust_id = c.id;
If a customer has no orders, they don’t appear. If an order has no matching customer (which would indicate a referential integrity violation), it doesn’t appear.
LEFT OUTER JOIN
All rows from the left table, plus matches from the right. Unmatched left rows get NULL on the right side.
SELECT c.name, o.amount
FROM customers c
LEFT JOIN orders o ON o.cust_id = c.id;
Grace Hopper has no orders. She still appears in the result with NULL in the amount column. This is the join you use when the question is “list all customers, and if they have orders, show those.”
RIGHT OUTER JOIN
Mirror image of LEFT. All rows from the right table, unmatched right rows get NULL on the left. Rarely used in practice — you can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the tables, and most query authors prefer to read left-to-right.
FULL OUTER JOIN
All rows from both tables. Unmatched rows get NULLs on the side they didn’t match. Useful for finding data that should be in both tables but isn’t — reconciliation reports, for instance.
The fifth join nobody asked for: CROSS JOIN
CROSS JOIN is the Cartesian product — every row of the left table paired with every row of the right, no match condition. If you have 1,000 customers and 10,000 orders, a cross join returns 10 million rows. Usually a mistake. Occasionally intentional — e.g., generating all possible combinations of values for testing.
-- DANGER: rarely what you want
SELECT c.name, o.amount
FROM customers c
CROSS JOIN orders o;
A missing ON clause in SQL often produces an accidental cross join, which is why modern SQL dialects sometimes refuse to run a JOIN without an explicit join condition.
The Three Join Algorithms
Writing JOIN in SQL is declarative — it says what to join, not how. The query optimizer picks the physical algorithm based on table sizes, available indexes, memory, and statistics. There are three main choices.
Nested Loop Join
for each row r in outer_table:
for each row s in inner_table:
if match(r, s):
emit(r, s)
Complexity: O(N × M) for tables of size N and M. Naive and slow for large tables — but if there’s an index on the inner table’s join column, you can replace the inner loop with an index lookup, dropping the cost to O(N × log M).
Nested loop wins when:
- One table is tiny (a few rows).
- There’s a good index on the inner table’s join column.
- Memory is tight — nested loop doesn’t need to buffer.
Hash Join
build: hash_table = {}
for each row r in smaller_table:
hash_table[r.join_key].append(r)
probe: for each row s in larger_table:
for r in hash_table.get(s.join_key, []):
emit(r, s)
Complexity: O(N + M) with roughly O(N) memory. Much faster than nested loop for large tables, but requires enough memory to hold the smaller table’s hash. Variant: grace hash join partitions both tables to disk first, then hash-joins partition by partition — still reasonable when the data doesn’t fit in RAM.
Hash join wins when:
- Both tables are large.
- Neither has a useful index on the join column.
- The smaller table fits in memory.
Sort-Merge Join
sort both tables by the join key
walk both sorted streams in parallel, emitting matches
Complexity: O(N log N + M log M) due to sorting, but if the data is already sorted (e.g., reading from a B-tree index on the join key), it drops to O(N + M).
Sort-merge join wins when:
- Both tables are already sorted by the join key.
- The data is too big for hash join’s memory footprint.
- The output needs to be sorted anyway.
Modern databases also have variants: merge join with non-equality, broadcast join (for distributed systems, where one side is small enough to copy to every node), bloom-filter-assisted joins, and so on. The three above are the core cases.
Why Join Ordering Matters
A query that joins four tables can be executed in many different orders:
SELECT *
FROM customers c
JOIN orders o ON o.cust_id = c.id
JOIN order_items i ON i.order_id = o.id
JOIN products p ON p.id = i.product_id;
There are 4! / 2 = 12 ways to order these joins (accounting for left-deep trees). Each order produces the same result but potentially with vastly different performance:
- Join
ordersandorder_itemsfirst (both large tables) — you get a huge intermediate result. - Filter
customersto London first, then join to orders — much smaller intermediate result. - Join
productsearly if it’s small and highly selective — very small intermediate.
Choosing among these is the job of the query optimizer, covered in its own post. The choice depends on table sizes, available indexes, and cardinality estimates from statistics. Get it wrong and your query runs in hours instead of milliseconds.
Most “this query is slow” problems in real databases are join-ordering problems in disguise. EXPLAIN ANALYZE usually reveals that the optimizer picked a bad order because its statistics were stale or because the query structure confused the estimator.
Why Not Just Use a Graph Database?
A fair objection: if what we want is relationships, why not store them directly? Graph databases (Neo4j, JanusGraph, AWS Neptune) store nodes and edges as first-class entities, indexed for fast traversal.
The answer is that most workloads aren’t really graph workloads. They have relationships, but those relationships have a limited depth and a predictable shape. A typical OLTP query joins 2-5 tables. A typical analytical query aggregates across a handful of dimensions. These are cases where:
- The relational model’s table-and-join abstraction matches the data’s structure.
- Indexes on foreign-key columns make joins cheap.
- The query optimizer can find good plans.
Graph databases shine at deeply recursive queries: “find all people within six degrees of separation of this person who work in the same industry.” That’s a genuine graph traversal that SQL struggles to express elegantly (though modern SQL has recursive CTEs).
For the typical enterprise workload — customers, orders, products, categories, transactions — the relational model with joins is the right abstraction. It’s been competitively optimized for fifty years. Graph databases have their niche, but it’s smaller than their marketing suggests.
Denormalization: When Joins Hurt
At sufficient scale, joins can become the bottleneck. A query that joins a billion-row transactions table with a billion-row users table and a million-row products table can be slow no matter how good the optimizer is, simply because of data volume.
The pragmatic response is denormalization — storing redundant data to avoid the join at query time. Instead of:
SELECT t.amount, u.country
FROM transactions t
JOIN users u ON u.id = t.user_id;
You might pre-store country on the transactions table itself:
SELECT amount, country
FROM transactions; -- no join needed
This is faster but has costs. Updates to a user’s country now have to propagate to every transaction row. Storage is bigger. The data model is less clean — you’ve duplicated information.
Data warehouses (Snowflake, BigQuery, Redshift) frequently denormalize aggressively, because analytical queries benefit from avoiding joins and the read-to-write ratio is so high that the update cost is acceptable. OLTP databases denormalize sparingly, because their write-heavy workloads would have to update redundant copies constantly.
This is an active tradeoff, not a solved problem. The relational model gives you the normalized design as the starting point; denormalization is a deliberate, measurable departure from it.
What the JOIN Actually Proved
The Relational Model post calls the JOIN “the key insight” of the relational model. That framing is accurate but clinical. What the JOIN actually proved, in practice:
- Data outlives its original use cases. A schema designed around foreign keys can be queried for combinations the original designer didn’t anticipate. Hierarchical and network schemas couldn’t.
- Logical relationships beat physical ones. A pointer is a specific memory address. A foreign key is a shared value, which is stable across migrations, replication, partitioning, and disaster recovery.
- The database engine can improve. Joins can be implemented naively (nested loop) or sophisticatedly (hash, sort-merge), and improving the implementation doesn’t require rewriting any application code. Fifty years of database engineering has been adding smarter join algorithms to the same SQL surface.
The pre-relational databases weren’t slow. They were inflexible. The JOIN is what made the relational model flexible: any relationship, discovered on demand, with the algorithm chosen by the optimizer at runtime. The data doesn’t need to know how it’s going to be queried. That’s the property that made the relational model durable in a way nothing before it was.