Prerequisites
MapReduce: Thinking in Parallel
Google's programming model for processing massive datasets across thousands of machines changed how we think about distributed computation.
The Google Problem
By 2003, Google was indexing billions of web pages. No single machine could process that much data. Jeff Dean and Sanjay Ghemawat published the MapReduce paper, describing a simple abstraction: split your computation into a map phase (transform each record independently) and a reduce phase (aggregate the results).
Input → [Map] → Shuffle → [Reduce] → Output
"the cat sat on the mat"
Map: the→1, cat→1, sat→1, on→1, the→1, mat→1
Reduce: the→2, cat→1, sat→1, on→1, mat→1
The Impact
MapReduce spawned Hadoop, which spawned an entire ecosystem (Hive, Pig, Spark) and the “Big Data” era. More importantly, it taught a generation of engineers to think about computation as data pipelines — an idea that echoes in modern stream processing (Kafka, Flink) and even frontend state management (Redux’s reducers).