Skip to main content
友田 陽大
TCP/IP・ネットワーク
TCP/IP
TCP
ネットワーク
パフォーマンス
可観測性
アーキテクチャ設計

A complete explanation of how TCP works: understanding the 3-way handshake, state transitions, retransmission, and congestion control via RFC 9293

An implementation guide that explains how TCP builds reliability, faithful to IETF primary sources (RFC 9293, 5681, 6298). It pulls together the 3-way handshake, the 11-state state machine, sequence numbers and ACK / retransmission (RTO, fast retransmit), flow control (window), and congestion control (slow start, CUBIC, BBR) — with diagrams, code, and observation commands — into a form usable for production debugging.

Published
Reading time
12 min read
Author
友田 陽大
Share

"TCP is reliable" — a commonly heard explanation, but not many people can explain "how" it builds that reliability. And in the field of a production incident, whether you know that very "how" determines the speed of isolation. Why does establishment take time? Why does throughput plummet when packet loss occurs? What is TIME-WAIT for? Why is accumulating CLOSE-WAIT "your own bug"?

This article explains TCP's heart — the 3-way handshake, the 11-state state machine, retransmission, flow control, and congestion control — faithful to IETF primary sources, and in a "usable in production" form with diagrams and observation commands. It's the inside of what the TCP/IP big-picture article stated as "TCP provides reliability, ordering, flow control, and congestion control."

Rules for this article: the specifications are based on TCP = RFC 9293 (August 2022, obsoletes RFC 793 et al. and updates RFC 1122), congestion control = RFC 5681, the retransmission timer (RTO) = RFC 6298, SACK = RFC 2018, and window scale/timestamps = RFC 7323. Concrete congestion-control algorithms such as CUBIC and BBR are OS-implementation-dependent. RFCs are revised, so confirm the latest version at rfc-editor.org.


1. The TCP segment header — grasp the "parts" of reliability first

Before the mechanism, let's confirm that the "tools" TCP uses to build reliability are all in the header. These are the main fields RFC 9293 defines (minimum 20 bytes).

FieldSizeRole
Source port / Destination port16 bits eachprocess multiplexing (part of the 5-tuple)
Sequence number32 bitsthe serial number of this segment's first byte. The core of ordering and loss detection
Acknowledgment number (ACK number)32 bits"the next byte number I want" = received up to here (cumulative ACK)
Data offset4 bitsheader length (needed because options make it variable)
Control flags1 bit eachCWR, ECE, URG, ACK, PSH, RST, SYN, FIN
Window16 bitsthe number of bytes the receiver can accept = flow control
Checksum16 bitsintegrity check of header + data + pseudo-header
Urgent pointer16 bitsonly when URG is valid

Grasp the meaning of the control flags and all the state transitions that follow become readable.

  • SYN: open a connection. Synchronize the sequence number.
  • ACK: the acknowledgment number is valid. Set on every segment after establishment.
  • FIN: send complete. Notifies "no more data to send" and begins a normal close.
  • RST: immediately reset the connection abnormally. The true identity of ECONNRESET.
  • PSH: an instruction to hand to the app immediately without buffering.
  • URG / urgent pointer: urgent data. Almost unused in modern times.
  • ECE / CWR: an extension for conveying congestion without dropping packets via Explicit Congestion Notification (ECN, RFC 3168).

2. The 3-way handshake — why are "3 times" necessary

TCP establishes a connection before communicating. RFC 9293 defines this as "the procedure of synchronizing the initial sequence number (ISN) in both directions and mutually confirming it."

Client                                      Server
   │                                          │
   │   ① SYN  seq=x                           │   "want to open. my start number is x"
   │ ───────────────────────────────────────► │
   │                                          │
   │   ② SYN, ACK  seq=y, ack=x+1             │   "OK (received up to x+1). my start number is y"
   │ ◄─────────────────────────────────────── │
   │                                          │
   │   ③ ACK  ack=y+1                          │   "OK (received up to y+1). established"
   │ ───────────────────────────────────────► │
   │                                          │
  ESTABLISHED                             ESTABLISHED

Why 2 times isn't enough: reliable communication requires bidirectional agreement. ① conveys the client→server ISN (x), ② conveys the server→client ISN (y), and each needs the other's ACK. Because ② doubles as the server's SYN and the ACK to the client's SYN, it compresses to 3 times total. That is the reason for "3-way."

Why ISN is random: making the ISN predictable enables the "sequence-number prediction attack," in which a third party injects forged segments. RFC 6528 specifies generating the ISN on a cryptographic-hash basis to make it hard to predict. A fine example of security woven into the procedure itself.

Design implication: the handshake always consumes at least one round trip (1 RTT). Furthermore, with HTTPS the TLS handshake is added on top. Establishing a new connection every time in a long-distance, high-RTT environment is fatally slow for this reason, which is the basis for why Keep-Alive and connection pools work. TCP Fast Open (RFC 7413) is an extension that cuts this initial RTT, but its scope of application is limited.


3. The 11-state state machine — the lifecycle of a TCP connection

RFC 9293 defines a TCP connection with 11 states: LISTEN, SYN-SENT, SYN-RECEIVED, ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT, plus the conceptual CLOSED. This is the source of those state names you see in ss and netstat.

3.1 Establishment phase

            ┌──────────┐
            │  CLOSED  │
            └────┬─────┘
       passive   │   active open (connect)
       open      │   send SYN
   (listen)      ▼
   ┌────────┐  ┌──────────┐
   │ LISTEN │  │ SYN-SENT │
   └───┬────┘  └────┬─────┘
  recv SYN        recv SYN/ACK
  send SYN/ACK    send ACK
       ▼               │
 ┌───────────────┐     │
 │ SYN-RECEIVED  │     │
 └──────┬────────┘     │
   recv ACK            │
        └──────┬───────┘
               ▼
        ┌─────────────┐
        │ ESTABLISHED │  ← data send/recv happens here
        └─────────────┘

3.2 Release phase (4-way, and then TIME-WAIT)

Because the close direction closes independently for send and receive, it is in principle a 4-time exchange (FIN/ACK in both directions).

active-close side (the one that called close)   passive-close side
   ESTABLISHED                              ESTABLISHED
       │  ① FIN ─────────────────────────────►│
   FIN-WAIT-1                                  │ recv FIN
       │◄───────────────────── ② ACK ─────────│ CLOSE-WAIT
   FIN-WAIT-2                                  │ (stays until the app calls close)
       │◄───────────────────── ③ FIN ─────────│ LAST-ACK
       │  ④ ACK ─────────────────────────────►│
   TIME-WAIT                                CLOSED
   (wait 2×MSL)
       ▼
   CLOSED

Three practical lessons come out of here.

  • TIME-WAIT (the active-close side waits 2×MSL): a safety device so that an old segment that arrived late doesn't pollute a new connection on the same 5-tuple. It is correct behavior, not an anomaly. However, opening many short-lived connections accumulates them and invites port exhaustion (→ big-picture article §5.1).
  • CLOSE-WAIT staying = almost always your own bug: upon receiving the peer's FIN the ACK is returned automatically, but to advance to LAST-ACK the app must explicitly close(). Forget to call it and it sits in CLOSE-WAIT forever, leaking FDs (file descriptors).
  • Interruption by RST: against the normal 4-way close, RST discards the connection without question. The receiving side gets ECONNRESET. Causes include an LB's idle-timeout disconnect, crashes, and queue overflows.

4. How reliability is built ①: sequence numbers, cumulative ACK, retransmission

4.1 The basics of cumulative ACK and retransmission

TCP assigns a sequence number to the bytes it sends, and the receiver returns "the next byte number I want" as an ACK number (cumulative ACK = meaning everything below it has been received). The sender retransmits data whose ACK doesn't return. This is the root of reliability.

There are two lines of retransmission triggers.

4.2 RTO (retransmission timeout) — give up by time

If an ACK doesn't return within a certain time, retransmit. That "certain time" is the RTO, and RFC 6298 specifies computing it dynamically from the measured RTT. It's not a fixed value.

SRTT     = smoothed round-trip time (moving average of RTT)
RTTVAR   = RTT variation (variance)
RTO      = SRTT + max(G, 4 × RTTVAR)      (G is the clock granularity)

The point is that the RTO is conservative (somewhat long) and grows exponentially on each retransmission (backoff). So even with "just one packet lost," an RTO-induced retransmission can be a perceptibly large delay. What avoids this is the next, fast retransmit.

4.3 Fast Retransmit — send without waiting on 3 duplicate ACKs

When the receiver receives a segment that skipped order (e.g., 1, 2, _, 4, 5), it repeatedly returns the same ACK saying "I still want 3." The sender, upon receiving 3 duplicate ACKs, immediately retransmits the relevant segment without waiting for the RTO to expire. Recovery is faster by the amount of not waiting for the timer.

4.4 SACK (Selective Acknowledgment) — convey exactly "what was missing"

With cumulative ACK alone, you can tell "3 is missing" but can't convey "4, 5 arrived," and you risk wastefully retransmitting through 4, 5. SACK (RFC 2018) appends "the received islands" to the ACK to convey them, enabling pinpoint retransmission of only the missing holes. It's enabled by default in modern stacks.

# SACK が有効か(Linux)
sysctl net.ipv4.tcp_sack          # = 1 なら有効
# 接続ごとの再送・RTT を覗く
ss -tin                            # rtt:, retrans:, cwnd: などが見える

5. How reliability is built ②: flow control and congestion control are "different things"

Conflate these and you'll misjudge the cause of a performance problem. Flow control protects the receiver; congestion control protects the network — the purposes differ.

5.1 Flow control (receive window) — "fast sender vs. slow receiver"

The receiver, via the header's Window field, notifies "the number of bytes it can accept right now" each time. The sender can't send beyond this. If the receiving app doesn't read and the buffer fills, the window becomes 0 and sending stops (zero window). When the receiver recovers, it resumes with a window update.

The 16-bit window can express only up to 64KB, which is insufficient on a fast, high-latency line. What extends this is the window scale option (RFC 7323), which exchanges a scaling factor at handshake time.

5.2 Congestion control — protect the whole from "network congestion"

The sender holds, separately from the receive window, an internal estimate called the congestion window (cwnd) of "the amount the network seems able to bear right now." What can actually be sent is min(receive window, cwnd). RFC 5681's foundation is like this.

① Slow start: double cwnd every 1 RTT (exponential). Until it reaches ssthresh.
② Congestion avoidance: after exceeding ssthresh, +1 MSS every 1 RTT (linear). Increase cautiously.
③ On loss detection:
   - duplicate ACK (fast retransmit) → halve ssthresh and shrink cwnd (fast recovery).
   - RTO expiry → reset cwnd to 1 and restart from slow start (heavy penalty).

This AIMD (additive increase, multiplicative decrease) is the true identity of the phenomenon "throughput plummets on packet loss." That is why loss is the natural enemy of performance, and even a slight loss rate can't use up the bandwidth on a long-distance fat pipe.

5.3 The concrete algorithm is OS-implementation-dependent — CUBIC and BBR

RFC 5681 is the framework, and the actual increase/decrease curve is decided by the algorithm. Two you meet in production:

  • CUBIC: Linux's default. Increases cwnd with a cubic function and recovers quickly even at high bandwidth/high latency. An improved version of the conventional type that treats loss as the congestion signal.
  • BBR (Google): estimates the optimal sending rate from the measured bandwidth and RTT, not from loss. It's strong against bufferbloat (delay from oversized buffers) and can be advantageous on lossy wireless / long-distance links.
# 利用可能/現在の輻輳制御アルゴリズム(Linux)
sysctl net.ipv4.tcp_available_congestion_control
sysctl net.ipv4.tcp_congestion_control     # 既定はたいてい cubic

Design implication: when "throughput doesn't grow even though I added servers," it's not rare for the cause to be not the app but congestion control × loss × RTT. Look at ss -tin's retrans and rtt: if the loss rate is high, suspect the path (wireless / tunnel / overloaded middlebox); if the RTT is large, suspect re-establishing connections (pooling) or region placement.


6. See it in code: half-open connections and Keep-Alive

Here's an example where knowledge of the state machine directly ties to a code bug. When one end disappears due to a crash or a NAT/LB idle disconnect, the other can't notice the connection died unless it sends something (because TCP is silent when idle). This is half-open. Unnoticed, you grab the dead connection from the pool and hit ECONNRESET.

import net from "node:net";

/** 死活を能動検知し、アイドルも区別して安全に畳む TCP クライアント設定 */
export function createResilientSocket(host: string, port: number): net.Socket {
  const socket = net.createConnection({ host, port });

  // ① TCP Keep-Alive:アイドル接続にプローブを送り、相手が消えていれば検知する
  //    (ハーフオープンの主防御。OS が定期的に空セグメントで生存確認)
  socket.setKeepAlive(true, 30_000); // 30秒アイドルでプローブ開始

  // ② アプリ層のアイドルタイムアウト:一定時間データが無ければ自分から畳む
  socket.setTimeout(60_000, () => {
    socket.destroy(new Error("idle timeout: no data for 60s"));
  });

  // ③ RST/異常を握りつぶさず観測(ECONNRESET の可視化)
  socket.on("error", (err) => {
    console.error(`[tcp] ${host}:${port} ${(err as NodeJS.ErrnoException).code ?? ""} ${err.message}`);
  });

  return socket;
}

Keep-Alive (the OS's TCP probe) and the app's timeout are different things. The former sees "whether the connection is physically alive," the latter "whether there's a logical response." In production, it's safe to set both shorter than the upstream LB's idle timeout (if the LB cuts first, you'll grab the dead connection).


7. TCP's limits — that's why QUIC/HTTP3 was born

Finally, two structural weaknesses of TCP. Knowing these makes the next technology choice foreseeable.

  1. Head-of-line blocking (HoLB): because TCP is "a single in-order byte stream," when one segment is lost, the (unrelated) data behind it also can't be handed to the app and is kept waiting. Even if HTTP/2 multiplexes over one TCP connection, all streams stop on a TCP-layer loss for this reason.
  2. The fixed cost of the connection-establishment RTT: 3-way + TLS means multiple RTTs every time. The more latency, the more it bites.

QUIC (RFC 9000, 2021) re-implemented "stream-independent reliability + encryption + congestion control" on top of UDP and, by recovering independently per stream, eliminated HoLB and shortened establishment to 0–1 RTT. HTTP/3 (RFC 9114, 2022) is built on this QUIC. It didn't discard TCP; it rebuilt the entire transport layer to avoid TCP's constraints — this is the biggest concrete example of the "layer swap" touched on in the TCP/IP big picture.

The decision axes for using TCP vs. UDP (and QUIC) are concretized in this cluster's 'The difference between TCP and UDP and how to choose'.


8. Summary

  • Establishment: synchronize the ISN in both directions with the 3-way SYN→SYN/ACK→ACK. The ISN is random for security (RFC 6528). The 1-RTT fixed cost always applies.
  • State machine: 11 states. Read it with ss and fault isolation becomes observation. CLOSE-WAIT staying = your own missing close(); TIME-WAIT is a normal safety device on the active-close side.
  • Retransmission: RTO (computed from measured RTT, exponential backoff) and fast retransmit (3 duplicate ACKs). Recover only the holes with SACK.
  • Two windows: flow control (receive window = protects the receiver) and congestion control (cwnd = protects the network) are different things. Because of AIMD, throughput plummets on loss. The algorithm is CUBIC (default) / BBR.
  • Limits: HoLB and the establishment RTT. What overcame these is QUIC/HTTP3.

TCP's mechanism only becomes a weapon once you can "read" the numbers in ss -tin. Next, we handle which to choose — TCP or UDP — in an actual project, and the decision axes for it.


I (Yudai Tomoda) handle reliability design that digs into TCP's state transitions, retransmission, and timeout design in payment and real-time backends. "Throughput plateaus," "ECONNRESET/CLOSE-WAIT increases," "many retransmissions and unstable latency" — I pinpoint the cause of such symptoms from observation with ss/tcpdump and cure them at the root with pooling, timeout budgets, and idempotency. Feel free to reach out.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

I can take on the implementation from this article as an engagement

Investigating network/low-level production incidents & designing for reliability

“ECONNRESET won't stop,” “TIME_WAIT is exhausting ports,” “lots of retransmits and unstable latency,” “afraid of double processing in payments/inventory,” “which of TCP/UDP/QUIC to choose” — I pin down the cause of these TCP/IP, low-level issues from ss/tcpdump observation and cure them with connection pooling, timeout budgets, and idempotency. With the payment-reliability experience that achieved zero double charges in production, I design backends that don't fall over, are traceable, and are correct.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading