Blueprints · 1,715 words · 7 min read

TCP Congestion Control: Nagle to BBR

Forty years of trying to answer one question — how fast should you send packets when you can't see the network you're sending them through?

#TL;DR

TCP doesn’t know what it’s sending into. The sender can’t see the queues on intermediate routers, the capacity of the bottleneck link, or how many other flows are competing for the same bandwidth. It has to infer all of it from two signals: packets that got acknowledged and packets that didn’t. That inference has been the subject of continuous research since 1984, when Nagle wrote the first congestion-control paper, and it’s still an open problem in 2026. Every Linux kernel release changes something about how your connections share a network they can’t see.

#The Original Problem: Congestion Collapse

In October 1986, the internet’s throughput between LBL and UC Berkeley — a few hundred yards of wire — dropped from 32 kbit/s to 40 bits per second. Three orders of magnitude of collapse. The network hadn’t been disconnected. It was still forwarding packets. It just wasn’t forwarding anything useful.

The cause was feedback. Early TCP implementations would, on a detected loss, retransmit the missing packet — aggressively, without backing off. When a router started dropping packets because its queue was full, every sender whose packets got dropped responded by retransmitting more. The queue overflowed harder. More packets got dropped. Every retransmission made the problem worse.

This was congestion collapse, and Van Jacobson diagnosed it in a 1988 SIGCOMM paper that is probably the most important paper in the history of transport protocols. Jacobson’s fix — and the fix every TCP has shipped since — was to couple the sender’s transmission rate to the signals coming back from the network, and to slow down when loss is detected, not speed up.

#Nagle (1984): The First Congestion Paper

Four years earlier, John Nagle at Ford Aerospace had written RFC 896, which doesn’t use the word “congestion control” but introduced most of the ideas. Nagle was thinking about a specific, embarrassing case: a Telnet session over a long-haul link sending one-byte packets.

Each one-byte payload became a full TCP/IP packet — 40 bytes of headers, 1 byte of data, and a separate ACK packet on the way back. The network was spending 98% of its work on packaging. Under any kind of load, the overhead alone could congest the link.

Nagle’s algorithm said: don’t send small packets while there’s unacknowledged data in flight. Hold onto small writes briefly, combine them, and send them when either (a) you have enough bytes to be worth sending or (b) the outstanding ACK comes back. This introduced the idea that a sender’s decision to transmit should depend on the state of the network, not just on having bytes available.

def nagle_should_send(bytes_queued, max_segment_size, has_outstanding_unacked):
    if bytes_queued >= max_segment_size:
        return True
    if not has_outstanding_unacked:
        return True
    return False  # hold it, wait for the ACK or more data

Modern applications that care about latency (interactive SSH, small HTTP requests) still sometimes disable Nagle with TCP_NODELAY. The tradeoff is still the same one Nagle articulated in 1984.

#Jacobson (1988): Slow Start, AIMD, and the Congestion Window

Jacobson’s SIGCOMM paper introduced four ideas that are still the core of TCP congestion control.

A congestion window (cwnd). In addition to the receiver’s advertised window (how much the receiver can buffer), the sender maintains its own estimate of how much the network can handle. The effective send window is the minimum of the two. Cwnd starts small and grows based on feedback.

Slow Start. Cwnd starts at one packet. Every ACK doubles it. This sounds slow — “slow start” is a joke name; it’s actually exponential growth — but it lets the sender probe the network’s capacity quickly without committing to a high rate up front.

Additive Increase, Multiplicative Decrease (AIMD). Once cwnd crosses a threshold, the growth switches from exponential to linear: one extra packet per RTT. When loss is detected, cwnd is cut in half (or worse). The idea is that slow probing upward is safe; overshooting is expensive, so the response to overshoot is sharp.

cwnd

  │                                        ╱╲
  │                                       ╱  ╲
  │              ╱╲                     ╱      ╲
  │             ╱  ╲                  ╱          ╲
  │            ╱    ╲     ╲╱╲      ╱             
  │           ╱      ╲           ╱
  │          ╱        ╲       ╱
  │         ╱          ╲    ╱
  │        ╱            ╲ ╱
  │       ╱              ╳  ← loss detected: cwnd cut in half
  │      ╱ (slow start — exponential)
  │     ╱
  │    ╱
  │   ╱
  └────────────────────────────────────────► time

Fast Retransmit / Fast Recovery. Instead of waiting for a timeout to detect loss (which burns one full RTT of inactivity), the sender infers loss when it sees three duplicate ACKs — a receiver complaining about the same missing byte three times in a row. Retransmit immediately; don’t drop cwnd all the way to one.

These four ideas, in combination, transformed TCP from “congestion collapses the network” to “hundreds of competing flows settle into something like fair share.” The math isn’t obvious but holds up: AIMD is one of the few mechanisms that provably converges to fairness among flows sharing a bottleneck.

The 1988 paper also introduced RTT estimation — the algorithm for measuring a connection’s round-trip time so you know when a packet is late enough to assume it’s lost. This is itself a surprisingly subtle problem; Jacobson’s original EWMA-based estimator is still roughly what modern TCP uses.

#Reno, NewReno, SACK: The 1990s Iteration

Through the 1990s, variants on Jacobson’s original scheme got named after research labs and implementations:

  • Tahoe — the original Jacobson TCP. On any loss, drop cwnd to 1 and restart slow start. Conservative.
  • Reno — adds Fast Recovery. Keep cwnd higher after a single loss; assume the network isn’t actually collapsed.
  • NewReno — better behavior when multiple packets are lost in the same window.
  • SACK (Selective Acknowledgment, RFC 2018) — lets the receiver tell the sender exactly which packets are missing, rather than just “I’m still waiting for byte X.” This is what every modern TCP uses; without SACK, the sender has to guess.

By the late 1990s, TCP had pretty much solved the problem Jacobson set out to solve: preventing congestion collapse and sharing bandwidth fairly among flows on similar networks. What it hadn’t solved was: what happens when the networks aren’t similar?

As backbone links moved from megabits to gigabits and cross-continent latencies stayed the same, AIMD became embarrassingly conservative. A 10 Gbps, 100 ms cross-Pacific link requires a huge cwnd to be fully utilized. Reno’s linear +1 per RTT takes forever to grow to that size.

Concrete example: a Reno flow on a 10 Gbps link with 100 ms RTT, after a single loss, would take about 80 minutes to refill its window. That’s not a network using its capacity — that’s a network with a loss event at the start of a file transfer and a workstation that finishes the transfer over lunch.

CUBIC (Ha, Rhee, Xu, 2008) replaced the linear growth function with a cubic one. After a loss, cwnd grows slowly at first (to stay near the known-safe region), then accelerates, then flattens again near where the previous loss happened. It finds the high-bandwidth operating point quickly and holds there until the next loss signal.

CUBIC has been the Linux default since kernel 2.6.19 (late 2006). Most of the world’s data transfer runs on it.

#BBR (2016): Model the Network Instead of Reacting

Every algorithm above — Reno, NewReno, CUBIC — is loss-based. It assumes that if you’re not losing packets, you can send faster. That assumption breaks when there are large buffers in the path.

Bufferbloat is the phenomenon where a cheap home router has several hundred milliseconds of buffer. A loss-based TCP fills that buffer before it sees loss, and the RTT balloons from 20 ms to 500 ms while cwnd keeps climbing. The connection isn’t congested in the “about to drop packets” sense — it’s congested in the “delivering packets with absurd latency” sense — and loss-based algorithms can’t tell the difference.

BBR (Bottleneck Bandwidth and Round-trip propagation time), published by Google engineers Cardwell, Cheng, Gunn, Yeganeh, and Jacobson (yes, the same Jacobson) in 2016, took a completely different approach: build a model of the network and pace to it.

BBR measures:

  • Bottleneck bandwidth (BtlBw) — the maximum throughput observed over the last few seconds.
  • Round-trip propagation time (RTprop) — the minimum RTT observed (the physical path, not inflated by queueing).

It then sends at a rate such that cwnd = BtlBw × RTprop. That’s the bandwidth-delay product — the exact amount of in-flight data to keep the bottleneck full without building a queue. BBR periodically probes upward to re-measure BtlBw and downward to re-measure RTprop, but spends most of its time cruising at the right rate.

BBR dramatically outperformed CUBIC on bufferbloated paths. It also had a fairness problem on early versions (BBRv1 was aggressive against CUBIC flows sharing a link) that BBRv2 and BBRv3 have been iteratively fixing. As of 2026, BBR is widely deployed on Google’s infrastructure, enabled as a non-default option in Linux, and remains an area of active research.

#Why This Is Still Being Tuned

You might reasonably ask: if Jacobson solved congestion collapse in 1988, why are people still publishing papers on this?

  • New link types keep breaking assumptions. Satellite (Starlink), mobile 5G, data center Ethernet, lossy Wi-Fi — each has different latency, loss, and queue characteristics, and each exposes corner cases the old algorithms weren’t designed for.
  • Data center networks care about tail latency, not throughput. DCTCP, DCQCN, and similar algorithms assume you can signal congestion with ECN marks (single-bit notifications from routers) rather than waiting for drops. This is a different optimization target.
  • Internet TCP has to play fair. A new algorithm that’s great for your own traffic but starves other flows sharing a link won’t be accepted. Most of the difficulty of modern congestion control is proving fairness properties, not raw throughput.
  • QUIC changes the landscape. Because QUIC runs in user space, congestion-control algorithms can be deployed by upgrading a binary rather than shipping a kernel. The pace of iteration has accelerated dramatically.

The TCP/IP post mentions congestion control in a single paragraph as something that’s “still being tuned today.” It is — and the reason is fundamental. The internet is a moving target: new applications, new link types, new scale regimes, new fairness requirements. There’s no final answer to “how fast should I send,” because the question keeps getting redefined.

What’s held up is the structure Jacobson established in 1988: use feedback from the network to decide the rate, be cautious about ramping up, cut back sharply when something goes wrong, and assume that the only thing you know about the network is what its ACKs and losses tell you. Forty years on, every new algorithm is still answering the same question — what should the next packet do? — with slightly better telemetry and slightly smarter math.