Skip to main content
友田 陽大
TCP/IP・ネットワーク
TCP/IP
ネットワーク
TCP
UDP
アーキテクチャ設計
可観測性

TCP/IP complete guide: turning the mechanism of the 4-layer model, IP, TCP, and UDP into production design with RFCs and real code

An implementation guide that explains TCP/IP in a form usable for production design. Faithful to IETF primary sources (RFC 1122, 791, 8200, 9293, 768), it systematizes — more clearly than the official docs — the 4-layer model and encapsulation, IP addressing and CIDR, the difference between TCP and UDP, Node.js/TypeScript real code, behavior that matters in production like TIME_WAIT, Keep-Alive, and MTU, and observation/debugging.

Published
Reading time
18 min read
Author
友田 陽大
Share
Contents

"Just leave the network to someone who knows it well" — you can stay thinking that only while production is smooth. ECONNRESET floods the logs. Connections clog only right after a deploy. An API across an L7 load balancer times out at 30 seconds. A payment request comes back in a state of "did it succeed or fail, I can't tell." These are all not app bugs but the behavior of TCP/IP itself. Without knowing the mechanism, the logs stay "incantations," and each time you add a symptomatic-treatment retry, the situation worsens.

This article is an implementation guide to turn TCP/IP from "memorizing for a certification exam" into knowledge usable for production design judgments. It pins down what each layer guarantees and doesn't guarantee with IETF primary sources (RFCs), runs it with Node.js/TypeScript real code, and connects, as a continuum, to "operational knowledge not written in the official docs but definitely effective in production" (TIME_WAIT, Keep-Alive, MTU, connection pools). As subject matter, I mix in the judgments from the serverless payment platform where I designed and led the payment-reliability layer (on the premise of mobile-line timeouts and retransmissions, achieving 0 double charges in production with idempotency).

The rules of this article: protocol provisions are based on IETF primary sources (RFCs). The cores are TCP = RFC 9293 (August 2022, obsoletes RFC 793 and others), UDP = RFC 768 (1980), IPv4 = RFC 791, IPv6 = RFC 8200, and host requirements and layering = RFC 1122. Since RFCs are revised/obsoleted, always confirm the latest version at rfc-editor.org before production design. The code is arranged to run with Node.js's standard libraries (net, dgram), but ports, hosts, and timeout values are on the premise of environment variables, and in production always adjust based on observed values.


0. First, the standing: "TCP/IP" isn't a single standard

Before design, let's pin down the true nature of the term in three lines.

  • TCP/IP is the collective name for the "protocol suite." It's a name borrowing the names of the two representative protocols (TCP and IP), and its substance is a collection of many protocols: IP, TCP, UDP, ICMP, ARP, DNS, etc.
  • The IETF defines this as "layers." RFC 1122 (Host Requirements) provides for internet communication divided into the four layers Application / Transport / Internet / Link. The OSI reference model (7 layers) taught in school is for conceptual organization; the canon of implementation is this one.
  • Each layer is stacked on the premise of "not trusting the layer below." IP doesn't guarantee delivery. So TCP creates reliability on top of it. Understanding this "boundary of guarantees" is the one and greatest crux of learning TCP/IP.

This article proceeds from bottom to top — the Internet layer (IP) → the Transport layer (TCP/UDP) → implementation and operation in the app — on the axis of "what each layer guarantees and doesn't guarantee."


1. The 4-layer model and encapsulation — how data is wrapped and flies

1.1 The 4-layer model and the correspondence with OSI

RFC 1122 layerRole (what it guarantees)Representative protocolsAddress unitData name
ApplicationMeaning between apps (HTTP, gRPC, etc.)HTTP, DNS, TLS, SMTPmessage
TransportInter-process multiplexing, (TCP) reliabilityTCP, UDPport numbersegment / datagram
InternetHost-to-host end-to-end deliveryIP, ICMPIP addresspacket
LinkNode-to-node transfer within the same linkEthernet, Wi-Fi, ARPMAC addressframe

The correspondence with OSI's 7 layers is roughly as follows. The concerns corresponding to OSI's "session layer and presentation layer" (TLS, character encoding, compression) are, in TCP/IP, handled collectively by the Application layer.

OSI 7 layers       TCP/IP 4 layers (RFC 1122)
Application  ┐
Presentation ├──►  Application   (HTTP, TLS, DNS, gRPC)
Session      ┘
Transport    ───►  Transport     (TCP, UDP)
Network      ───►  Internet      (IP, ICMP)
Data Link    ┐
Physical     ┴──►  Link          (Ethernet, Wi-Fi)

1.2 Encapsulation — "nested envelopes"

On sending, the upper layer's data is wrapped in order by the lower layers' headers. This is encapsulation. By the time an HTTP request of GET / physically flies, it's wrapped like this.

[ Ethernet header [ IP header [ TCP header [ HTTP data ] ] ] FCS ]
   └ Link layer    └ Internet layer └ Transport layer └ Application layer

The receiving side unpacks in reverse order (decapsulation), peeling off each layer's header and passing it up. What's important is that each layer looks at only "its own header." The IP layer doesn't know TCP's contents, and the TCP layer doesn't know HTTP's meaning. This separation of concerns is the core of the design that has made TCP/IP scale on a half-century scale (a good example of SRP working in protocol design).

Implication for design: this nesting means each layer can evolve independently. HTTP/1.1 → HTTP/2 was realized without changing TCP, and HTTP/3 conversely replaced the Transport layer from TCP to QUIC (UDP-based). Being conscious of "which layer you replace and what changes" raises the resolution of technology selection.


2. The Internet layer: IP is "best-effort" — it doesn't guarantee delivery

2.1 What IP guarantees and doesn't guarantee

The role of IP defined by RFC 791 (IPv4) and RFC 8200 (IPv6) is "to do its best to deliver a packet to the destination IP address" — just that. Concretely, it doesn't guarantee the following.

  • No delivery guarantee: it can be discarded mid-route (congestion, TTL expiry, routing failure).
  • No order guarantee: the order can be swapped going through a different route.
  • No deduplication: the same packet can arrive multiple times.
  • The guarantee of integrity is limited: the IPv4 header has a checksum, but it covers the header only. The integrity of the payload is left to the Transport layer (the TCP/UDP checksum). IPv6 abolished the header checksum itself.

In other words, IP is "mail that posts but doesn't guarantee delivery." This very resignation produces the design responsibility for the upper TCP of "where to create reliability."

2.2 IP addresses and CIDR — narrowed to the knowledge you definitely use in production

IPv4 is 32 bits (e.g., 192.0.2.10), IPv6 is 128 bits (e.g., 2001:db8::1). What's essential in production infrastructure design is the following three points.

(1) CIDR notation (RFC 4632): the /16 of 10.0.0.0/16 means "the upper 16 bits are the network portion." VPC subnet design can't start without being able to read this.

10.0.0.0/16   → 10.0.0.0 - 10.0.255.255   (65,536 addresses, 65,534 hosts)
10.0.1.0/24   → 10.0.1.0 - 10.0.1.255     (256 addresses, 254 hosts)
              * In each subnet, two — the network address and the broadcast — are reserved

(2) Private addresses (RFC 1918): reserved ranges for closed networks that don't go out to the internet. VPCs and intranets are carved from here.

10.0.0.0/8        (10.x.x.x)            largest scale
172.16.0.0/12     (172.16-172.31.x.x)  medium scale, Docker's default
192.168.0.0/16    (192.168.x.x)         home, small scale

(3) MTU and MSS: the maximum payload one link can carry (MTU) is usually 1500 bytes on Ethernet. The MSS (Maximum Segment Size) obtained by subtracting the IP/TCP headers from this is the upper limit TCP can send in one segment (typically 1460 bytes). When the MTU shrinks across a VPN, tunnel, or between clouds, large packets silently drop, becoming the nightmare failure of "only communication of a specific size gets stuck." On routes where Path MTU Discovery (RFC 8201) doesn't work, suspecting here hits the mark.

2.3 The reading points of the IPv4 header

Memorizing all fields is unnecessary. What you actually look at in failure analysis is only the following.

  • TTL (Time To Live): decreases by 1 each time it crosses a router, discarded at 0. It's loop prevention and also the principle of traceroute.
  • Protocol: the kind of upper protocol. 6 = TCP, 17 = UDP, 1 = ICMP. Looking at this value with tcpdump gives a guess of the contents.
  • Source / Destination Address: the source/destination IP. Note that the source is rewritten across NAT.

3. The Transport layer: TCP and UDP — where to create "reliability"

Once the Internet layer has decided it's "best-effort," if reliability is needed, someone must create it. What takes on that responsibility in a full set is TCP; what deliberately doesn't take it on and stays minimal is UDP.

3.1 Port numbers — multiplexing multiple processes on one host

The IP address points to "which host," and the port number (16 bits, 0-65535) points to "which process on that host." A single communication is uniquely identified by the 5-tuple (source IP, source port, destination IP, destination port, protocol).

  • Well-known ports (0-1023): HTTP=80, HTTPS=443, SSH=22, DNS=53, etc.
  • Ephemeral ports (49152-65535, IANA-recommended): temporary ports a client uses dynamically for each connection. This exhaustion is the essence of the TIME_WAIT problem described later.

3.2 The four guarantees TCP (RFC 9293) provides

According to RFC 9293, TCP provides a connection-oriented reliable stream. Concretely, the following four.

  1. Reliability: with sequence numbers, ACKs, and retransmission, it recovers lost data.
  2. Ordering: the receiving side reorders by sequence number and passes it to the app in order.
  3. Flow control: with the receiving side's Window field (the number of bytes it can receive), it controls so that a fast sender doesn't overflow a slow receiver.
  4. Congestion control: it detects network congestion and autonomously throttles the sending rate (the slow start and congestion avoidance of RFC 5681). RFC 9293 makes the implementation of congestion control mandatory.

The costs of these are "the round trip to establish the connection (3-way handshake)," "holding state," and "head-of-line blocking." The details of the mechanism (handshake, the 11-state state machine, retransmission, congestion control) are dug into in a dedicated article (this cluster's "TCP mechanism complete explanation").

3.3 UDP (RFC 768) — the decisiveness of deliberately doing nothing

RFC 768's UDP header is a mere 8 bytes (source port, destination port, length, checksum). The spec clearly states "delivery and duplicate protection are not guaranteed." There's no connection establishment; it throws a datagram out of the blue.

There are domains where this "decisiveness" becomes a weapon — real-time audio/video, DNS, games, and QUIC (the foundation of HTTP/3). Which of TCP and UDP to choose, I show the judgment axes, including QUIC/HTTP/3, in this cluster's "The difference between TCP and UDP and how to use them."


4. Real code: understand TCP / UDP by "touching" them in Node.js

The theory settles when you move your hands. With just Node.js's standard libraries, feel the difference between TCP and UDP. Firm up the types with TypeScript.

4.1 TCP echo server (the net module)

import net from "node:net";

const PORT = Number(process.env.TCP_PORT ?? 9000);

const server = net.createServer((socket: net.Socket) => {
  // 5タプルでこの接続を識別できる
  const peer = `${socket.remoteAddress}:${socket.remotePort}`;
  console.log(`[open] ${peer}`);

  // TCP はバイトストリーム。データは「メッセージ単位」では届かない点に注意(後述)
  socket.on("data", (chunk: Buffer) => {
    console.log(`[recv] ${peer} ${chunk.length} bytes`);
    socket.write(chunk); // そのまま返す(echo)
  });

  socket.on("end", () => console.log(`[end]  ${peer}`)); // 相手が FIN を送った
  socket.on("error", (err) => console.error(`[err]  ${peer} ${err.message}`)); // RST 等
});

server.listen(PORT, () => console.log(`TCP echo listening on :${PORT}`));
// クライアント
import net from "node:net";

const socket = net.createConnection(
  { host: "127.0.0.1", port: Number(process.env.TCP_PORT ?? 9000) },
  () => socket.write("hello tcp"), // 接続確立(3-way handshake 完了)後に送信
);

socket.on("data", (data: Buffer) => {
  console.log("echo:", data.toString());
  socket.end(); // FIN を送って正常クローズ
});
socket.setTimeout(5_000, () => socket.destroy(new Error("idle timeout")));

4.2 The fatal pitfall: "TCP doesn't preserve message boundaries"

In the server above, even if the client calls socket.write("AB") and socket.write("CD") consecutively, the server's data event might come as "ABCD" once, or as two times, "A" and "BCD". TCP is a byte stream, and there's no guarantee that one write corresponds to one data — this is a bug a beginner definitely steps on.

So you need to implement "framing (delimiting)" yourself at the app layer. The representative is the "length-prefix" method.

// 4バイトのビッグエンディアン長 + 本体、というフレームを復元する
class LengthPrefixedDecoder {
  private buf = Buffer.alloc(0);

  /** チャンクを push し、完成したメッセージだけを配列で返す(総関数:未完成なら []) */
  push(chunk: Buffer): Buffer[] {
    this.buf = Buffer.concat([this.buf, chunk]);
    const out: Buffer[] = [];
    while (this.buf.length >= 4) {
      const len = this.buf.readUInt32BE(0);
      if (this.buf.length < 4 + len) break; // まだ本体が揃っていない
      out.push(this.buf.subarray(4, 4 + len));
      this.buf = this.buf.subarray(4 + len); // 消費した分を捨てる
    }
    return out;
  }
}

The reason HTTP and gRPC look "message-oriented" is that such framing (for HTTP, Content-Length or chunked; for gRPC, a 5-byte prefix) is implemented at the upper layer. When you use raw TCP, engrave in your mind that you must prepare this layer yourself.

4.3 UDP (the dgram module) — boundaries are preserved, but delivery isn't guaranteed

import dgram from "node:dgram";

const PORT = Number(process.env.UDP_PORT ?? 9001);
const server = dgram.createSocket("udp4");

// UDP は「1 send = 1 メッセージ」。境界は保たれる。ただし順序も到達も保証されない
server.on("message", (msg: Buffer, rinfo) => {
  console.log(`[recv] ${rinfo.address}:${rinfo.port} "${msg}"`);
  server.send(msg, rinfo.port, rinfo.address); // エコー(届かなくても誰も気づかない)
});
server.bind(PORT, () => console.log(`UDP listening on :${PORT}`));

The essential difference between TCP and UDP shows in the code: TCP sets up a "connection" and flows bytes but doesn't preserve boundaries. UDP throws per message without a "connection" but has no delivery guarantee. This asymmetry is the root of all design judgments.


5. TCP behavior that matters in production — the "operational knowledge" beyond the RFCs

From here is the area that doesn't appear on certification exams but definitely matters in production. I list mainly the points I actually stepped on in a payment platform and dealt with by design.

5.1 TIME_WAIT — the true nature of "I closed the connection but can't connect"

The side that actively closed a TCP connection (often the client, or an L7 proxy) stays in the TIME_WAIT state for 2×MSL (Maximum Segment Lifetime, typically around 60 seconds total) after closing. This is a safety device "so that an old packet that arrived delayed doesn't pollute a new connection of the same 5-tuple," and is correct behavior provided for by RFC 9293.

The problem is the case of setting up a large number of high-frequency short-lived connections (e.g., a proxy connecting anew to the upstream API each time). When TIME_WAIT piles up, Ephemeral ports are exhausted, and new connections can't be set up with EADDRNOTAVAIL. The countermeasures are in this priority order.

  1. Reuse connections (most important): Keep-Alive and connection pools. If you don't close them in the first place, TIME_WAIT isn't born.
  2. Adjusting kernel parameters (net.ipv4.tcp_tw_reuse, etc.) is a last resort for symptom relief. Reducing connections by app design comes first.

5.2 Keep-Alive and connection pools — "don't close" is justice

import net from "node:net";

const socket = net.createConnection({ host, port });

// TCP Keep-Alive:アイドル接続が生きているかを定期的に確認(死活監視)
socket.setKeepAlive(true, 30_000); // 30秒アイドルで keepalive プローブ開始

// Nagle アルゴリズムを無効化:小さなパケットを即送る(後述)
socket.setNoDelay(true);

For an HTTP client, always use a connection pool. Node.js's undici (the substance of fetch) and AWS SDK v3 pool by default. The iron rule is to set the pool size and the idle timeout shorter than the upstream's timeout (if the upstream cuts first, you grab a dead connection and step on ECONNRESET).

5.3 The Nagle algorithm and delayed ACK — the worst-compatibility combination

  • The Nagle algorithm: accumulates small data and sends it together (saving bandwidth).
  • Delayed ACK: delays the ACK a little and batches it (same as above).

When these two mesh, "the sender waits for an ACK, the receiver waits for data," and a deadlock-like delay of up to several hundred ms occurs. If you want to flow interactive small RPCs at low latency, cutting Nagle with setNoDelay(true) (TCP_NODELAY) is the standard. On the other hand, for throughput-focused bulk transfer, not cutting it can sometimes be better — not "always cut" but judge by the workload.

5.4 backlog and SYN — "clogs only right after a deploy"

The backlog of listen(backlog) is the queue length that holds established connections not yet accepted. If this is small, right after startup or during a spike, connections silently drop/delay. It's rare that Node.js's default (511) is insufficient, but you need to design it together with the backlog/connection-count limit of the reverse proxy or LB in front.

5.5 Distinguish "three kinds" of timeout

If you lump "timeout" together, you err in design. Separate at least these three.

KindTime it waits for whatToo shortToo long
Connection timeoutCompletion of the 3-way handshakeCuts even healthy connectionsGrabs a dead destination long
Idle/receive timeoutTime no data comesCuts normal slow responsesCan't detect a hang
Overall (request) timeoutThe total time of one requestInduces a retry stormOccupies resources long

In a payment platform, I designed these in a staircase of "the more outside, the longer; the more inside, the shorter" from upstream to the end (a timeout budget). If the inside is longer than the outside, after the outside gives up, the inside keeps running, falling into the worst state of doing work no one is waiting for while eating up resources.


6. Observation and debugging — triage "it won't connect" by fact

If you can read TCP's state transitions, failure triage changes from guessing to observation. Let me list practical tools on Linux.

# 1) いまの TCP 接続と状態を一覧(ss は netstat の後継・高速)
ss -tan
#   State      Recv-Q Send-Q   Local Address:Port   Peer Address:Port
#   ESTAB      0      0        10.0.1.5:443         10.0.2.9:51324
#   TIME-WAIT  0      0        10.0.1.5:51200       10.0.3.4:5432   ← 大量なら 5.1 を疑う
#   SYN-SENT   0      1        10.0.1.5:40012       10.0.9.9:80     ← 相手が SYN-ACK を返していない

# 2) 状態ごとに件数を集計(TIME-WAIT 肥大の検知に有効)
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn

# 3) パケットを直接見る(handshake が成立しているか)
sudo tcpdump -ni eth0 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-rst) != 0)'

# 4) 経路途中のどこで詰まるか(TTL を 1 ずつ増やして応答元を見る)
traceroute -T -p 443 api.example.com   # -T で TCP SYN を使う

A quick reference to derive the cause from the state:

  • SYN-SENT stagnates → the SYN went out to the destination but the SYN-ACK doesn't come back. Suspect FW/SG/routing/destination-process stop.
  • Many SYN-RECV → a SYN flood or backlog exhaustion. Check SYN Cookies.
  • Many TIME-WAIT → creating too many short-lived connections (§5.1). Make them Keep-Alive.
  • CLOSE-WAIT stagnates → the app isn't calling close() (a bug in your own code). The typical resource leak.
  • Stays ESTAB with no response, Recv-Q grows → the app isn't reading the data (processing is clogged).

Only the stagnation of CLOSE-WAIT is almost certainly a bug in your own code (received a FIN but not closing the socket). The others have the possibility of the peer or the network, but here suspect the app.


7. Reliability design: don't fully rely on TCP's "reliability"

Finally, the most practical lesson. TCP creates reliability with "retransmission." Conversely, this means from the network's view, delivery is at-least-once. Furthermore, in the world above it — load balancers, API gateways, mobile lines, Lambda retries — the same request arriving multiple times isn't an anomaly but a daily occurrence.

The core I designed in the payment platform was exactly here. The challenge was precisely "due to mobile-line timeouts and API Gateway/Lambda retries, the same payment request arrives multiple times." Against this,

  • Accept the retry itself as the normal path (admit that the network definitely retransmits).
  • And on top of that, converge the charging to just once — constrain the client-issued idempotency key to a one-time-only with a conditional write (attribute_not_exists), and make the balance update atomic with DynamoDB's atomic transaction.

As a result, even if retransmission happens at all layers above TCP, as the app's meaning it becomes exactly-once, and I achieved 0 double charges in production.

Generalizing the lesson:

TCP's reliability only fixes "the loss of bytes within the same connection." It doesn't fix the real duplicates of the connection dropping and being re-set-up, or the upper layer retrying. So in domains like payments, inventory, and messaging, don't throw reliability wholesale onto the network — design it as idempotency at the app layer.

This is a good example of TCP/IP knowledge not ending as low-layer cultivation but directly tying to production monetary correctness.


8. Summary: a checklist to turn TCP/IP into "design judgment"

  • You can state the boundary of guarantees: IP is best-effort (guarantees neither delivery, order, nor deduplication). Reliability is added by TCP, and minimal multiplexing only by UDP.
  • You can explain the nesting of encapsulation (each layer looks at only its own header = separation of concerns).
  • You can read CIDR / RFC 1918 / MTU·MSS (VPC design and triaging MTU-caused failures).
  • TCP is a byte streamwrite and data aren't 1-to-1. In raw TCP, implement framing yourself.
  • You design TIME_WAIT, Keep-Alive, Nagle, backlog, and the three kinds of timeout from an operational viewpoint.
  • You read TCP states with ss and can derive the cause from SYN-SENT/CLOSE-WAIT/TIME-WAIT.
  • You don't throw reliability wholesale onto the network and design the app to be idempotent on the premise of duplicates.

TCP/IP isn't "a classic to memorize" but active design knowledge that determines production latency, cost, and correctness. Next, I dig into its heart — TCP's state machine, retransmission, and congestion control — along RFC 9293.


I (Yudai Tomoda) design and implement, with one person × generative AI (Claude Code), non-crashing, traceable, correct backends that account even for low-layer behavior. "ECONNRESET won't stop," "ports dry up with TIME_WAIT," "I don't understand the design of timeouts and retries," "I'm afraid of double processing in payments/inventory" — such network/reliability-caused challenges, I cure at the root by identifying the cause from reproduction and observation of the phenomenon, with idempotency and appropriate timeout design. Please feel free to consult me.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

I can take on the implementation from this article as an engagement

Investigating network/low-level production incidents & designing for reliability

“ECONNRESET won't stop,” “TIME_WAIT is exhausting ports,” “lots of retransmits and unstable latency,” “afraid of double processing in payments/inventory,” “which of TCP/UDP/QUIC to choose” — I pin down the cause of these TCP/IP, low-level issues from ss/tcpdump observation and cure them with connection pooling, timeout budgets, and idempotency. With the payment-reliability experience that achieved zero double charges in production, I design backends that don't fall over, are traceable, and are correct.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading