Auto-Recovery Pipelines

Auto-Recovery Pipelines

Networks misbehave in the real world. BLE links jitter, Wi-Fi bursts collide, buffers overflow, and low-power sensors occasionally flip a bit. The worst part isn’t the packet loss—it’s the visible glitch your user sees: a frozen chart, a jumpy graph, or a dropped audio blip. This post shows how to build an auto-recovery pipeline that quietly detects corruption, drops the bad data, resynchronizes in milliseconds, and conceals the failure so the experience stays smooth. We’ll follow a practical deep dive (Problem → Approach → Process → Results → Takeaways) with tiny, copy-pasteable snippets, production-ready diagrams you can recreate, and pragmatic defaults that work on constrained MCUs and mobile apps alike.


Problem

  • Corruption happens (RF noise, partial frames, buffer overruns, misordered delivery).
  • Late/duplicate packets arrive during roaming and power transitions.
  • Naive retries create visible stalls and battery drain.
  • UI/ML pipelines magnify tiny defects into obvious artifacts (e.g., spikes in time-series, audible clicks, stuck values).
  • Background constraints (sleep states, throttled timers) make “retry until success” unreliable and expensive.

The challenge: guarantee smoothness without heavyweight handshakes or wasting energy—especially on mobile + embedded links—while keeping downstream analytics trustworthy.


Approach

Design the runtime around four fast, local decisions:

  1. Detect: Per-packet integrity (CRC/length/schema), monotonic sequence checks, and timing sanity.
  2. Decide: Classify as good, late, duplicate, corrupt, or missing.
  3. Conceal: For gaps, apply play-out smoothing (interpolation/hold-last/PLC) tuned to the modality.
  4. Re-sync: Use selective retransmission (NACK/SACK) or lightweight FEC to repair without stalling the UI.

Wrap this with a tiny state machine, a jitter buffer, and exponential backoff timers. Push just enough intelligence to the edge to hide transient problems, and keep cloud/app aggregation idempotent.


Process

1) Packet Format That Makes Recovery Cheap

A compact header makes wrong data obvious and repair cheap:

// Minimal header (6 bytes)
typedef struct __attribute__((packed)) {
  uint16_t seq;       // increments per packet
  uint8_t  flags;     // bit0:keyframe, bit1:parity, bit2:resync
  uint8_t  schema;    // data layout version
  uint16_t crc16;     // header+payload CRC
} hdr_t;
  • seq: Detect loss/out-of-order quickly.
  • flags: Mark keyframes, parity/FEC blocks, or resync beacons.
  • schema: Reject incompatible layouts immediately.
  • crc16: Fast validation on MCU/SoC without heavy CPU.

Tip: Put the header up front, payload fixed-length where possible, and align to DMA-friendly boundaries so the CRC happens zero-copy.


2) Fast Integrity & Ordering

Do O(1) checks per packet before touching shared buffers:

bool good = crc16_ok(pkt) && len_ok(pkt) && schema_ok(pkt.schema);
if (!good) { stats.bad_crc++; classify_and_drop(pkt); return; }

int16_t d = (int16_t)(pkt.seq - expect_seq);  // handles wrap
if (d == 0)          accept_in_order(pkt);
else if (d > 0)      enqueue_gap_and_pkt(d, pkt);   // missing (d-1) packets
else                 handle_late_or_duplicate(pkt); // negative delta

Rules of thumb

  • Late window: accept up to N late packets (e.g., N=3) if re-sequencing is cheap.
  • Duplicate: drop and count—don’t punish the link.
  • Huge jump: trigger fast resync (see §5).

3) Jitter Buffer That Hides Human-Visible Stalls

A small time-aligned buffer smooths bursts without hurting latency.

// Pseudocode
const int DEPTH_MS = 80;   // tune via P95 inter-arrival jitter
JitterBuffer jb(DEPTH_MS);

on_packet(pkt) {
  jb.push(pkt.ts, pkt.payload);
  while (jb.ready(now_ms())) {
    emit_to_consumer(jb.pop());
  }
}

Sizing (start simple):
depth_ms ≈ p95_jitter_ms + decode_budget_ms + safety_10ms

If you can’t measure yet, begin with 60–100 ms, then track p95 and shrink as transport stabilizes.


4) Concealment That Fits the Signal

Corruption creates gaps. Fill them differently per modality:

  • Scalar time-series (weight, temperature, heart rate):
    • ≤1 sample gap: linear interpolate
    • 2–3 samples: slope-limited interpolation
    • >3 samples: hold-last + flag degraded
# linear interpolate one missing scalar sample
def conceal(prev_val, next_val):
    return prev_val + 0.5*(next_val - prev_val)
  • Vector IMU streams: clamp velocity, interpolate per axis, median-of-3 to kill spikes.
  • Audio/continuous: packet-loss concealment (PLC) with short time-stretch or noise fill; avoid audible clicks.
  • Categorical events (labels): last-valid-wins with decay; if decay exceeds T, emit “unknown”.

Golden rule: concealments are annotated (a flag bit) so downstream analytics can ignore or weight them less.


5) Resync Without Freezing the UI

Prefer selective repair that doesn’t block playout:

  • NACK/SACK: Ask only for what’s missing within a small window.
  • Parity FEC (XOR across a short block): Recover one lost packet per block without a round-trip.
  • Keyframes/resync beacons: Jump-to-now when repairing the past is too costly.

Tiny FEC XOR (block of 4 + 1 parity)

// Sender: build parity for packets k..k+3
uint8_t parity[PAYLOAD] = {0};
for (int i = 0; i < 4; i++) xor_bytes(parity, pkt[k+i].payload, PAYLOAD);
send_packet(PARITY_FLAG, parity);

// Receiver: if exactly one missing in the block, reconstruct
if (missing_count == 1) {
  uint8_t rec[PAYLOAD] = {0};
  xor_bytes(rec, parity, PAYLOAD);
  for (each received r in block) xor_bytes(rec, r.payload, PAYLOAD);
  deliver(rec);
}

When to NACK vs FEC?

  • Short RTT, light loss → NACK (lower overhead).
  • Bursty loss or higher RTT → small-block FEC (no UI stall).

6) Timers & Backoff That Don’t Kill Battery

  • Immediate local actions (detect/decide/conceal) are synchronous and cheap.
  • Network repair uses exponential backoff with jitter to avoid thundering herds:
// After sending a NACK and not hearing back
retry_ms = min(retry_ms * 2, 1000);
retry_ms = jitter(retry_ms, +/-0.2); // 20% jitter
schedule(retry_ms);
  • Guardrails: cap retries per window, auto-promote to RESYNC after K attempts, and never block playout.

7) Observability: Prove You’re Hiding the Pain

Expose counters and percentiles:

  • packets_total, bad_crc, late, duplicate, gaps_filled, fec_repairs, nacks_sent, resyncs
  • p95_interarrival_ms, p99_jitter_ms, stall_time_ms
  • Experience SLOs:
    • Glitch-free minutes per session
    • Time-to-steady after link flap (p95)
    • Concealment ratio (how often you’re patching)

8) Idempotent Aggregation (Cloud or App Core)

When the same sequence shows up twice (due to retransmit or late arrival), upserts on (stream_id, seq) keep storage clean. Derived roll-ups (e.g., per-minute stats) are computed on confirmed packets only, or on values with concealed=false unless explicitly allowed.

Idempotent insert (pseudo-SQL):

INSERT INTO samples(stream_id, seq, payload, concealed)
VALUES(:s, :q, :p, :c)
ON CONFLICT(stream_id, seq) DO UPDATE
SET payload=excluded.payload,
    concealed=LEAST(samples.concealed, excluded.concealed);

9) Latency Budget & Configuration Knobs

Define a hard budget from wire to UI:

Total budget (ms) = transport_jitter_p95 + decode + jitter_buffer + UI_queue
Target: < 150 ms for “feels live” streams

Pragmatic defaults

  • jitter_buffer_depth_ms: 80 (start); shrink toward 40 as link stabilizes.
  • late_accept_window: 3 packets.
  • fec_block: 4+1 parity for bursty links; off for very low loss.
  • resync_after: max(3 NACK retries, 250 ms gap).
  • concealment_max_span: 3 samples; else mark degraded.

Runtime config (YAML-like):

recovery:
  jitter_ms: 80
  late_window: 3
  fec:
    enabled: true
    block: 4
  nack:
    max_retries: 3
    backoff_ms: [50, 100, 200, 400]
  resync:
    keyframe_interval_ms: 1000
  conceal:
    max_span: 3
    policy: scalar_linear|vector_median|audio_plc

10) Threads, Buffers, and Zero-Copy

  • Single-producer single-consumer (SPSC) ring buffers avoid locks between ISR/DMA producers and user-space consumers.
  • Fixed allocations: pre-allocate packet nodes to avoid heap churn; drop oldest on pressure.
  • Zero-copy CRC by pointing CRC engine at DMA’d regions; avoid memcpy just to validate.

SPSC ring (sketch):

volatile uint32_t head, tail;
Packet slots[RING_SIZE];

bool push(Packet p) {
  uint32_t h = head, n = (h+1) & (RING_SIZE-1);
  if (n == tail) return false; // full
  slots[h] = p;
  head = n;
  return true;
}

11) Transport-Specific Notes (Generic)

  • Connection churn: after a disconnect, wait for kernel/stack cleanup before reconnect attempts; use short randomized delays to reduce collisions.
  • Background modes: schedule repair work in permitted windows; never let repair timers block playout.
  • PHY/MTU changes: treat negotiated parameters as volatile; cache short-term, but handle downgrade without assuming stability.

12) Chaos Testing & Validation

Build a local fault injector:

  • Random bit flips and truncations (toggle CRC failures).
  • Burst loss (drop K of N consecutive).
  • Reorder windows (swap adjacent, delay by 1–3 positions).
  • Latency jitter shaped to match field p95/p99.

Acceptance criteria

  • Stall time per minute < threshold (e.g., < 200 ms).
  • Glitch-free minutes/session ↑ vs baseline.
  • Concealment ratio stays within guardrails (e.g., < 8% of samples).
  • No unbounded growth in retry counter or jitter depth.

13) Backpressure & Flow Control

When consumers fall behind:

  • Drop strategy: prefer dropping oldest unconsumed rather than newest; degrade gracefully.
  • Signal upstream with a watermark: when buffer > 70%, request lower rate or more keyframes.
  • Protect UI thread: never block the render path; emit “degraded” state instead.

14) Security & Robustness Quick Notes

  • Treat schema as a gate: unknown schema → quarantine, don’t parse.
  • CRC failure → drop without side effects; don’t store.
  • Sequence math should be wrap-safe (uint16 wrap) and resistant to malicious jumps (promote to RESYNC on absurd deltas).
  • Log the first occurrence of each error type per session to reduce noise.

Results

Teams implementing this pattern commonly observe:

  • Near-zero visible stalls for short burst losses (≤3 packets) thanks to jitter buffers + concealment.
  • Fast recovery after roaming or RF spikes, as selective repair avoids full pipeline pauses.
  • Lower energy use than fixed-interval retries, because backoff + jitter prevents retry storms.
  • Cleaner analytics: downstream models can ignore flagged concealments, reducing false spikes.

How to validate

  • AB test stall time per minute and glitch-free minutes.
  • Track concealment ratio (non-zero but not dominating).
  • Measure p95 time-to-steady after induced disconnects (toggle AP, attenuate RF).
  • Confirm no UI thread blocking and bounded memory under burst loss.

Takeaways

  • Put Detect → Decide → Conceal → Resync between transport and consumers.
  • Sequence + CRC make bad data cheap to spot; jitter buffers make bursts invisible.
  • Use NACK/SACK for quick holes; add small-block FEC for bursty links or higher RTT.
  • Conceal by modality (scalar, vector, audio) and always flag patched samples.
  • Keep retries bounded with exponential backoff + jitter; never block playout.
  • Instrument experience SLOs (glitch-free minutes, stall time, time-to-steady)—optimize for what users feel.

At Hoomanely, our mission is to turn raw pet signals into trustworthy, real-time wellness insights. Auto-recovery pipelines make the data boringly reliable, even when radios aren’t. Whether it’s activity streams, bowl weight changes, or ambient audio cues, this layer ensures our app and models see clean, continuous signals, so pet parents get confident guidance instead of jittery charts. The same primitives you’ve seen here—sequence-guarded packets, modality-aware concealment, and selective resync—power the perception of effortless reliability across our product surfaces.


Minimal Snippets You Can Reuse

Corruption classifier:

enum Class { GOOD, LATE, DUP, CORRUPT, MISSING };

Class classify(pkt, expect_seq) {
  if (!crc16_ok(pkt) || !len_ok(pkt) || !schema_ok(pkt.schema)) return CORRUPT;
  int16_t d = (int16_t)(pkt.seq - expect_seq);
  if (d == 0) return GOOD;
  if (d > 0) return MISSING;     // (d-1) missing before this one
  return seen(pkt.seq) ? DUP : LATE;
}

Late-packet acceptance window (re-sequence ≤3):

if (classify(pkt, expect_seq) == LATE && within_window(pkt.seq, expect_seq, 3))
  insert_in_order(pkt);
else if (classify(pkt, expect_seq) == LATE)
  drop(pkt); // outside window

Resync trigger:

if (missing_span > MAX_WINDOW || retries > MAX_RETRIES) {
  send_resync_request();
  state = RESYNC;
}

Invisible quality is built, not wished for. Put this layer between your transport and your users, and you’ll ship experiences that feel wired—even over the messiest radios.

Read more