TCP Sender Implementation: Key Algorithms and Best Practices
Summary: this article outlines the sender-side responsibilities in TCP, the core congestion-control and loss-recovery algorithms you’ll implement, practical engineering details, and recommended best practices for robust, high-performance TCP senders.
1. Sender responsibilities (concise)
- Maintain connection state: sequence numbers, snd_una, snd_nxt, snd_wnd, rcv_wnd, SRTT, RTTvar, RTO.
- Flow control vs. congestion control: respect receiver window (rwnd) and congestion window (cwnd); limit bytes in flight = min(cwnd, rwnd).
- Retransmission logic: detect loss (dupACKs, RTO) and schedule retransmits.
- ACK processing: advance snd_una, update RTT estimator, free buffers, perform fast retransmit/recovery.
- Segmentization and MSS handling, TCP options (SACK, timestamps, window scale, ECN).
2. Core algorithms to implement
-
Slow Start and Congestion Avoidance (AIMD)
- Start cwnd = 1–10 MSS (commonly 1 or 2). On each ACK in slow start, cwnd += MSS (exponential growth). On loss, ssthresh = max(cwnd/2, 2*MSS) and cwnd := 1 MSS (or half for NewReno fast-recovery behavior).
- After cwnd ≥ ssthresh, switch to congestion avoidance (additive increase): cwnd += MSS*MSS / cwnd per ACK (approximates +1 MSS / RTT).
-
Fast Retransmit and Fast Recovery (NewReno)
- On 3 duplicate ACKs -> fast retransmit lost segment(s). Set ssthresh = max(cwnd/2, 2*MSS).
- Enter fast recovery: set cwnd = ssthresh + 3*MSS (or appropriate inflation), retransmit, and on each additional duplicate ACK, inflate cwnd by 1 MSS to allow new segments; on ACK acknowledging new data, set cwnd = ssthresh and switch to congestion avoidance.
- Implement SACK-aware recovery to retransmit multiple losses in one RTT.
-
Selective Acknowledgment (SACK)
- Parse SACK option, maintain SACK scoreboard, retransmit the highest-priority missing segments first, avoid unnecessary retransmits.
- Use SACK for quicker recovery from multiple losses.
-
RTT and RTO calculation (Jacobson/Karels)
- Maintain SRTT and RTTVAR; set RTO = SRTT + max(G, K*RTTVAR) (K=4 per RFC6298; initial RTO = 1s).
- Use timestamp option to measure RTT for retransmitted packets; apply Karn’s algorithm (do not use RTT samples for retransmitted segments).
-
Path MTU Discovery & MSS clamping
- Discover path MTU, clamp MSS accordingly to avoid fragmentation.
-
ECN (Explicit Congestion Notification)
- Support ECE/CWR handshake: on ECE, reduce cwnd similarly to loss and send CWR flag; follow RFC recommendations.
-
Modern CCAs (implement as option)
- CUBIC: time-based cubic cwnd growth for high BDP links (default in many OSes).
- BBR (or Copa): model-based rate control that targets bottleneck bandwidth and min-RTT; requires a different sender architecture (bandwidth/RTT probes, pacing).
- Make congestion control pluggable so you can switch algorithms without changing core stack.
3. Practical implementation details
-
Pacing and batching
- Use packet pacing (pace packets across RTT) to reduce burst losses and queueing. Pacing interval ≈ RTT/cwnd or use a token-bucket paced by desired rate.
- Batch ACK processing and transmit work in kernel path to reduce syscalls, but keep per-packet timers and retransmit accuracy.
-
Timers and timers wheel
- Use a scalable timer mechanism (timer wheel, heap) for per-connection RTOs and delayed-ACK timers; avoid per-packet timers where possible.
- Implement delayed ACKs (commonly 200 ms or 2 ACKs) but allow policy tuning.
-
Retransmission policy
- Exponential backoff on RTOs (double RTO each consecutive timeout up to a maximum).
- Use Limited Transmit: when small flight size, allow sending new segments on duplicate ACKs before retransmit to stimulate ACKs.
- On RTO, reset cwnd = 1 MSS (or use modern modifications like RFC6298 guidance).
-
Buffer management and memory
- Use ring buffers for send queues; avoid copying via scatter-gather I/O where possible.
- Limit per-connection memory with high and low watermarks and allow congestion control to backpressure application.
-
SACK scoreboard and selective retransmit
- Track outstanding segments, holes, retransmit timestamps, and retransmission counts.
- Implement heuristics to avoid retransmitting segments presumed lost but potentially reordering victims.
-
Handling reordering
- Be conservative with interpreting duplicate ACKs as loss if reordering is common; if known path reordering, tune dupACK threshold or use delay-based signals (RTT rise).
-
Security and robustness
- Validate sequence numbers and window updates.
- Implement blind reset protections (challenge ACKs) and SYN flood mitigations (SYN cookies).
- Rate-limit retransmissions and protect against amplification abuses.
4. Testing and observability
- Unit tests for state transitions (slow start, fast recovery, RTO).
- Emulated network tests (netem, tc) for latency, loss, reordering, and bandwidth-delay product scenarios.
- Large-scale fuzzing: random drops, reordered packets, duplicated segments, misordered SACK blocks.
- Performance benchmarks: measure goodput, RTT, retransmission rate, fairness with other flows (Reno/CUBIC/BBR).
- Metrics and logging: cwnd, ssthresh, RTT samples, RTOs, retransmits, SACK holes, bytes-in-flight, pacing rate. Export via telemetry (prometheus, logs).
5. Best practices and recommendations
- Make congestion control modular: pluggable algorithms (Reno, NewReno, CUBIC, BBR) so you can test and switch.
- Default to conservative, well-tested behavior (SACK + NewReno/CUBIC) with options for BBR if you need low latency and controlled probing.
- Use SACK and Timestamps: these greatly improve recovery and RTT accuracy.
- Implement pacing by default to reduce bursts and bufferbloat.
- Use careful RTO and backoff rules (Karn’s algorithm, exponential backoff) to avoid spurious retransmit storms.
- Prefer bandwidth-delay-aware algorithms (CUBIC/BBR) for high BDP links; prefer delay-based elements if low latency matters.
- Ensure robust handling of middleboxes: many paths rewrite ECN or strip options—implement fallbacks.
- Provide tunable knobs but sensible defaults; avoid exposing low-level timers to apps unless needed.
- Prioritize observability: include counters and state dumps to diagnose performance issues quickly.
6. Example simplified sender loop (pseudo-steps)
- On new data from app: segment up to MSS, queue, attempt send up to min(cwnd, rwnd) bytes in flight.
- On ACK:
- Update SRTT (if ACK acknowledges original segment and not retransmit).
- Advance snd_una, free acknowledged data buffers.
- Update cwnd (slow start or congestion avoidance).
- If dupACK count >= threshold -> fast retransmit/recovery (SACK-aware).
- On 3dupACK + SACK holes -> retransmit missing segments per scoreboard.
- On RTO -> retransmit earliest unacked segment, backoff RTO, reset cwnd = 1 MSS, ssthresh = max(cwnd/2, 2*MSS).
- Regularly pace transmissions by token bucket or timer tick; respect pacing and avoid bursts.
7. Interoperability and RFCs to follow
- RFC 793 (original TCP) — fundamentals.
- RFC 1122 — host requirements.
- RFC 2988 / RFC 6298 — RTO calculation and initial values.
- RFC 2018 — SACK.
- RFC 7323 — TCP extensions for high performance (timestamps, window scale).
- RFC 3168 — ECN.
- RFC 5681 — TCP congestion control (AIMD, slow start, fast retransmit/recovery).
- RFC 8312 — CUBIC (for behavior details).
- BBR papers and drafts (for implementation notes).
8. When to choose which congestion control
- Short flows/latency-sensitive apps: prefer conservative cwnd growth + low queueing (delay-aware tuning); consider BBR v2 or Copa.
- Bulk transfer in high-BDP networks: CUBIC or other high-speed loss-based algorithms.
- Mixed environments with reordering: enable SACK and tune dupACK threshold or use delay signals.
Closing note: implement a clear separation between core TCP mechanics (sequencing, retransmit timers, ACK processing) and congestion-control policy, make the CCA pluggable, default to SACK+timestamps+pacing, and instrument heavily for real-world tuning.
Leave a Reply