Troubleshooting a TCP Sender: Common Issues and Fixes

TCP Sender Implementation: Key Algorithms and Best Practices

Summary: this article outlines the sender-side responsibilities in TCP, the core congestion-control and loss-recovery algorithms you’ll implement, practical engineering details, and recommended best practices for robust, high-performance TCP senders.

1. Sender responsibilities (concise)

  • Maintain connection state: sequence numbers, snd_una, snd_nxt, snd_wnd, rcv_wnd, SRTT, RTTvar, RTO.
  • Flow control vs. congestion control: respect receiver window (rwnd) and congestion window (cwnd); limit bytes in flight = min(cwnd, rwnd).
  • Retransmission logic: detect loss (dupACKs, RTO) and schedule retransmits.
  • ACK processing: advance snd_una, update RTT estimator, free buffers, perform fast retransmit/recovery.
  • Segmentization and MSS handling, TCP options (SACK, timestamps, window scale, ECN).

2. Core algorithms to implement

  • Slow Start and Congestion Avoidance (AIMD)

    • Start cwnd = 1–10 MSS (commonly 1 or 2). On each ACK in slow start, cwnd += MSS (exponential growth). On loss, ssthresh = max(cwnd/2, 2*MSS) and cwnd := 1 MSS (or half for NewReno fast-recovery behavior).
    • After cwnd ≥ ssthresh, switch to congestion avoidance (additive increase): cwnd += MSS*MSS / cwnd per ACK (approximates +1 MSS / RTT).
  • Fast Retransmit and Fast Recovery (NewReno)

    • On 3 duplicate ACKs -> fast retransmit lost segment(s). Set ssthresh = max(cwnd/2, 2*MSS).
    • Enter fast recovery: set cwnd = ssthresh + 3*MSS (or appropriate inflation), retransmit, and on each additional duplicate ACK, inflate cwnd by 1 MSS to allow new segments; on ACK acknowledging new data, set cwnd = ssthresh and switch to congestion avoidance.
    • Implement SACK-aware recovery to retransmit multiple losses in one RTT.
  • Selective Acknowledgment (SACK)

    • Parse SACK option, maintain SACK scoreboard, retransmit the highest-priority missing segments first, avoid unnecessary retransmits.
    • Use SACK for quicker recovery from multiple losses.
  • RTT and RTO calculation (Jacobson/Karels)

    • Maintain SRTT and RTTVAR; set RTO = SRTT + max(G, K*RTTVAR) (K=4 per RFC6298; initial RTO = 1s).
    • Use timestamp option to measure RTT for retransmitted packets; apply Karn’s algorithm (do not use RTT samples for retransmitted segments).
  • Path MTU Discovery & MSS clamping

    • Discover path MTU, clamp MSS accordingly to avoid fragmentation.
  • ECN (Explicit Congestion Notification)

    • Support ECE/CWR handshake: on ECE, reduce cwnd similarly to loss and send CWR flag; follow RFC recommendations.
  • Modern CCAs (implement as option)

    • CUBIC: time-based cubic cwnd growth for high BDP links (default in many OSes).
    • BBR (or Copa): model-based rate control that targets bottleneck bandwidth and min-RTT; requires a different sender architecture (bandwidth/RTT probes, pacing).
    • Make congestion control pluggable so you can switch algorithms without changing core stack.

3. Practical implementation details

  • Pacing and batching

    • Use packet pacing (pace packets across RTT) to reduce burst losses and queueing. Pacing interval ≈ RTT/cwnd or use a token-bucket paced by desired rate.
    • Batch ACK processing and transmit work in kernel path to reduce syscalls, but keep per-packet timers and retransmit accuracy.
  • Timers and timers wheel

    • Use a scalable timer mechanism (timer wheel, heap) for per-connection RTOs and delayed-ACK timers; avoid per-packet timers where possible.
    • Implement delayed ACKs (commonly 200 ms or 2 ACKs) but allow policy tuning.
  • Retransmission policy

    • Exponential backoff on RTOs (double RTO each consecutive timeout up to a maximum).
    • Use Limited Transmit: when small flight size, allow sending new segments on duplicate ACKs before retransmit to stimulate ACKs.
    • On RTO, reset cwnd = 1 MSS (or use modern modifications like RFC6298 guidance).
  • Buffer management and memory

    • Use ring buffers for send queues; avoid copying via scatter-gather I/O where possible.
    • Limit per-connection memory with high and low watermarks and allow congestion control to backpressure application.
  • SACK scoreboard and selective retransmit

    • Track outstanding segments, holes, retransmit timestamps, and retransmission counts.
    • Implement heuristics to avoid retransmitting segments presumed lost but potentially reordering victims.
  • Handling reordering

    • Be conservative with interpreting duplicate ACKs as loss if reordering is common; if known path reordering, tune dupACK threshold or use delay-based signals (RTT rise).
  • Security and robustness

    • Validate sequence numbers and window updates.
    • Implement blind reset protections (challenge ACKs) and SYN flood mitigations (SYN cookies).
    • Rate-limit retransmissions and protect against amplification abuses.

4. Testing and observability

  • Unit tests for state transitions (slow start, fast recovery, RTO).
  • Emulated network tests (netem, tc) for latency, loss, reordering, and bandwidth-delay product scenarios.
  • Large-scale fuzzing: random drops, reordered packets, duplicated segments, misordered SACK blocks.
  • Performance benchmarks: measure goodput, RTT, retransmission rate, fairness with other flows (Reno/CUBIC/BBR).
  • Metrics and logging: cwnd, ssthresh, RTT samples, RTOs, retransmits, SACK holes, bytes-in-flight, pacing rate. Export via telemetry (prometheus, logs).

5. Best practices and recommendations

  • Make congestion control modular: pluggable algorithms (Reno, NewReno, CUBIC, BBR) so you can test and switch.
  • Default to conservative, well-tested behavior (SACK + NewReno/CUBIC) with options for BBR if you need low latency and controlled probing.
  • Use SACK and Timestamps: these greatly improve recovery and RTT accuracy.
  • Implement pacing by default to reduce bursts and bufferbloat.
  • Use careful RTO and backoff rules (Karn’s algorithm, exponential backoff) to avoid spurious retransmit storms.
  • Prefer bandwidth-delay-aware algorithms (CUBIC/BBR) for high BDP links; prefer delay-based elements if low latency matters.
  • Ensure robust handling of middleboxes: many paths rewrite ECN or strip options—implement fallbacks.
  • Provide tunable knobs but sensible defaults; avoid exposing low-level timers to apps unless needed.
  • Prioritize observability: include counters and state dumps to diagnose performance issues quickly.

6. Example simplified sender loop (pseudo-steps)

  1. On new data from app: segment up to MSS, queue, attempt send up to min(cwnd, rwnd) bytes in flight.
  2. On ACK:
    • Update SRTT (if ACK acknowledges original segment and not retransmit).
    • Advance snd_una, free acknowledged data buffers.
    • Update cwnd (slow start or congestion avoidance).
    • If dupACK count >= threshold -> fast retransmit/recovery (SACK-aware).
  3. On 3dupACK + SACK holes -> retransmit missing segments per scoreboard.
  4. On RTO -> retransmit earliest unacked segment, backoff RTO, reset cwnd = 1 MSS, ssthresh = max(cwnd/2, 2*MSS).
  5. Regularly pace transmissions by token bucket or timer tick; respect pacing and avoid bursts.

7. Interoperability and RFCs to follow

  • RFC 793 (original TCP) — fundamentals.
  • RFC 1122 — host requirements.
  • RFC 2988 / RFC 6298 — RTO calculation and initial values.
  • RFC 2018 — SACK.
  • RFC 7323 — TCP extensions for high performance (timestamps, window scale).
  • RFC 3168 — ECN.
  • RFC 5681 — TCP congestion control (AIMD, slow start, fast retransmit/recovery).
  • RFC 8312 — CUBIC (for behavior details).
  • BBR papers and drafts (for implementation notes).

8. When to choose which congestion control

  • Short flows/latency-sensitive apps: prefer conservative cwnd growth + low queueing (delay-aware tuning); consider BBR v2 or Copa.
  • Bulk transfer in high-BDP networks: CUBIC or other high-speed loss-based algorithms.
  • Mixed environments with reordering: enable SACK and tune dupACK threshold or use delay signals.

Closing note: implement a clear separation between core TCP mechanics (sequencing, retransmit timers, ACK processing) and congestion-control policy, make the CCA pluggable, default to SACK+timestamps+pacing, and instrument heavily for real-world tuning.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *