LittleFS Tail-Latency for Burst Writes: Commit + Prealloc

LittleFS Tail-Latency for Burst Writes: Commit + Prealloc

When you first validate a bursty image pipeline on LittleFS, it often looks great: average write times are low, and short tests feel “done.” Then you run a 6–12 hour soak, the flash creeps toward full, and suddenly P95/P99 write latency starts spiking. Frames arrive in bursts, your producer thread blocks longer than expected, and the rest of the system starts paying the price—sensor scheduling jitter, CAN/USB backpressure, dropped frames, and even watchdog risk.

This post is a practical firmware playbook for reducing that long-tail latency without replacing your storage pipeline. The key is to stop treating “write time” as one number and instead make the system explain itself: how much time is spent pushing bytes versus updating metadata, and when garbage collection or compaction quietly steals your latency budget as the device ages and free space shrinks.

From there, we apply two small patterns that change stability dramatically. A commit marker makes each image set “publishable” only when it’s fully persisted (and remains safe under power loss), and best-effort preallocation reduces allocation churn so burst writes don’t repeatedly trigger worst-case GC behavior. The result is a storage path that stays predictably fast—even when the flash is hot, fragmented, and nearly full.


The Problem

LittleFS is designed for embedded flash realities (wear leveling, power-loss resilience, log-structured behavior). That design can create a common tail-latency failure mode for burst workloads:

  • Bursty producers create many short-lived files (or frequently new extents).
  • As the filesystem fills, allocation becomes harder, and internal cleaning/compaction becomes more frequent.
  • Metadata operations (directory updates, rename, close, sync) can suddenly take much longer than the data write itself.
  • The result: average throughput looks fine, but the tail is unpredictable.

In image pipelines (camera + thermal frames + metadata sidecars), the important performance truth is: Your system doesn’t fail on averages. It fails when the worst 1–5% of writes stall the whole pipeline.


Approach

We’ll take an incremental approach that is safe, measurable, and compatible with existing pipelines:

Measure what matters (and separate it)

Track timing for:

  • Data write time (payload bytes)
  • Metadata time (open/close, directory operations, rename)
  • Sync time (if you explicitly fsync/sync)
  • GC/compaction signal (inferred from long metadata time, or from block-device counters)

Commit marker pattern for “set correctness”

Write image sets in a way that downstream readers never see partial sets—especially after power loss:

  • Rename-based commit (.tmp → final) or
  • Footer-based commit (write a verified footer last)

Best-effort preallocation to reduce churn

Avoid repeated allocation under pressure:

  • Keep a small file pool (reused slots)
  • Or reserve a write budget (space discipline)
  • Or use “pre-created empty containers” that you fill and commit

None of this requires changing LittleFS internals. It’s firmware-level engineering around tail behavior.


Process

Step 1: Add instrumentation that isolates tail spikes

If you only time the whole “write set” call, you’ll miss the real source of stalls. Instead, segment your write path into phases that map to filesystem work.

What to measure

For each image set:

  • t_open_us
  • t_write_us (sum of payload writes)
  • t_close_us
  • t_commit_us (rename or footer finalization)
  • t_total_us

Also record:

  • fs_used_pct (estimated)
  • set_size_bytes
  • “mode” (empty FS vs partially full vs near full)

Two metrics that keep you honest

  • P99 set commit latency (ms)
  • Sustained sets/min after 6 hours (sets/min)

That’s enough to validate stability without turning your post into a dashboard.

Implementation sketch (minimal overhead)

typedef struct {
  uint32_t open_us, write_us, close_us, commit_us, total_us;
  uint32_t bytes;
  uint16_t fs_used_pct;
} write_trace_t;

static inline uint32_t now_us(void);

#define TRACE_START(var) uint32_t var##_t0 = now_us()
#define TRACE_END(var, out) do { (out) += (now_us() - var##_t0); } while(0)

Use it like:

write_trace_t tr = {0};
TRACE_START(total);

TRACE_START(open);
lfs_file_open(&lfs, &f, path_tmp, LFS_O_WRONLY | LFS_O_CREAT | LFS_O_TRUNC);
TRACE_END(open, tr.open_us);

TRACE_START(write);
lfs_file_write(&lfs, &f, buf, len);
TRACE_END(write, tr.write_us);

TRACE_START(close);
lfs_file_close(&lfs, &f);
TRACE_END(close, tr.close_us);

// commit marker step (rename or footer verify)
TRACE_START(commit);
commit_file(path_tmp, path_final);
TRACE_END(commit, tr.commit_us);

TRACE_END(total, tr.total_us);

How to estimate “fill level”

If you already track blocks used at the block-device layer, use that. Otherwise:

  • approximate from lfs_fs_size() (if available in your integration) and your configured block count
  • or maintain a coarse allocator watermark in your own storage service

You don’t need perfect accuracy—just enough to correlate spikes with “near full.”



Step 2: Use a commit marker so readers never see partial sets

Even if you optimize latency, you still need correctness: readers should only consume complete image sets. Tail latency and correctness are tied—if your pipeline retries or the device reboots mid-burst, you don’t want partially written files to be interpreted as “real data.”

Two patterns work well in LittleFS-based systems:

Option A: Rename-based commit

Write everything to a temporary name, then rename to final name when complete:

  • set_1234.tmp/
  • commit → set_1234/ (or set_1234.done marker)

Why rename works
Renames are typically metadata operations that LittleFS handles in a power-loss-safe way. The key benefit is semantic: downstream readers only scan final names.

Practical structure

  • Write image payload files to a temp directory:
    sets/set_1234.tmp/cam.bin, sets/set_1234.tmp/therm.bin, sets/set_1234.tmp/meta.json
  • After all closes succeed, commit:
    rename directory set_1234.tmpset_1234
  • Optionally create a COMMIT file inside final dir for easy scanning.

Reader rule

  • Ignore .tmp directories entirely
  • Only process set_* (final)
  • If you see set_* but missing expected files, treat as corrupted and quarantine.

Commit function sketch

int commit_dir(const char* tmp, const char* final) {
  // Ensure all files are closed before this point.
  // Rename is your atomic-ish "publish" step.
  int rc = lfs_rename(&lfs, tmp, final);
  return rc;
}

If rename/dir operations are your tail spikes, you can avoid rename and instead make the file self-validating:

  • Append a fixed footer at the end: {magic, version, length, crc32}
  • Only consider the set valid if footer exists and verifies.

This works well when you write a single “container file” per set:

  • set_1234.bin contains camera + thermal + metadata + footer
  • The footer is written last; if power is lost, verification fails and the reader skips it.

Footer layout example

  • magic = 0x53455421 (“SET!”)
  • payload_len
  • crc32(payload)

Reader rule

  • Scan file, read footer, verify CRC, then consume.
  • If invalid footer, skip.

In image-heavy products like EverBowl, commit semantics are what let downstream transfer/analytics stay simple: readers never need “partial write heuristics.” The pipeline can treat storage as a sequence of published sets, even across reboots.


Step 3: Best-effort preallocation to reduce allocation churn

Now we attack the main source of tail spikes in long runs: allocation/compaction pressure.

You don’t need perfect preallocation. You need enough to avoid “panic allocation” under near-full conditions.

Strategy 1: Reusable file pool (most practical)

Instead of creating brand-new filenames forever, create a bounded pool and reuse slots.

Example:

  • Pool of N set containers: slot_0000 … slot_0255
  • A small index file maps sequence_id -> slot_id
  • Each slot is overwritten in a controlled way (like a ring buffer)

Why it helps

  • Fewer directory entries changing over time
  • Less allocation churn
  • Predictable metadata patterns
  • GC pressure becomes smoother

Implementation idea

  • Pre-create the slot files once (at format time or first boot)
  • On each set, pick next slot, write there, commit with footer (or rename within slot namespace)

Strategy 2: Reserved space discipline (simple and effective)

Tail spikes get brutal when you’re writing into the last few percent of free space.

Introduce a rule:

  • Maintain X% reserved free (e.g., 5–10%)
  • If below reserve, switch to “degraded mode”:
    • drop optional frames
    • reduce burst depth
    • prioritize commit markers
    • or pause writes until upload/transfer frees space

This is not “giving up.” It’s enforcing predictability. Stable systems protect their future self.

Strategy 3: Pre-created directories for burst batches

If your burst writes create many directories, pre-create a limited number of batch directories:

  • batch_00 … batch_15
    Write sets into current batch until it’s “sealed,” then rotate.

This reduces directory-create bursts at the worst times.


Implementation detail: “best-effort” means you never block forever

The whole point of preallocation is to smooth tail latency, so it can’t be allowed to become a new source of stalls. In other words, preallocation is an optimization path, not a dependency. Your write pipeline must still make forward progress even when the filesystem is under pressure, the pool is fragmented, or a slot can’t be claimed quickly.

Treat preallocation like a bounded fast-path:

  • Attempt to allocate / pick a slot quickly using a strict time budget (or a small bounded retry count). If the pool is healthy, this stays cheap and predictable.
  • If it fails, immediately fall back to the normal write path (create/allocate as usual). You may take a latency hit in that case, but the system remains correct and doesn’t deadlock or starve the producer.
  • Record a metric whenever the fast-path misses—for example prealloc_hit_rate, prealloc_fallback_count, and optionally fallback_reason (no free slot, slot busy, reserve threshold reached). This turns “it felt slower today” into an actionable signal: you can tell whether the pool is undersized, whether cleanup is lagging, or whether you’re routinely operating too close to full.

In production pipelines where image sets are uploaded off-device (e.g., for ML validation or post-processing), preallocation makes the on-device behavior boringly consistent during long soak runs—exactly what you want when devices are deployed in homes and you can’t babysit flash state.


Results: How you validate improvements without vanity benchmarks

This is where teams often get stuck: they run a 2-minute test, declare victory, and ship. Tail latency requires a different validation style.

Soak test setup (repeatable and meaningful)

Run a long-duration write test that matches your real burst pattern:

  • Same set size distribution (camera + thermal + metadata)
  • Same burst cadence (e.g., 3 FPS bursts, then idle)
  • Same downstream interactions (if your reader scans storage)

Test at three fill levels:

  • Freshly formatted
  • Mid-fill (50–70%)
  • Near full (85–95%)

What you should expect after these changes

  • Commit correctness becomes deterministic (reader never sees partial sets).
  • P99 commit latency becomes much tighter, especially near full.
  • Overall throughput may remain similar, but the system becomes stable under stress.

If you want one clean target to track:

  • Improve P99 set commit latency by 3–5× under near-full conditions (typical outcome when allocation churn is the real culprit).

(Exact numbers depend on block size, wear level, and your burst size, but “multi-x tail improvement” is common when you stop fighting the allocator at the worst time.)


Common issues and how to avoid them

Rename is expensive on my device

Use footer commit with container files, or reduce rename frequency by committing per batch instead of per file.

My reader still sees weird artifacts

Enforce strict reader rules:

  • Only read published names OR verified footers
  • Quarantine invalid sets
  • Run a lightweight cleanup task that deletes .tmp artifacts on boot

Preallocation made it worse

That usually means:

  • You preallocated too aggressively (causing a big upfront stall)
  • Or your pool is too large and increases metadata load

Fix:

  • Start small (e.g., pool size that covers 2–5 minutes of worst-case burst)
  • Grow only if your traces show you need it

We can’t afford sync latency

Don’t sync after every write unless you must. Instead:

  • rely on commit markers for correctness
  • group sync at safe points (end of burst, or periodic)

Correctness should come from commit semantics—not from “sync everything all the time.”


Key Takeaways

  • Tail latency is usually metadata + allocation + GC, not raw data writes.
  • Start by instrumenting phases so you can see what’s actually spiking.
  • Use a commit marker (rename-based or footer-based) so readers never consume partial sets—even after power loss.
  • Apply best-effort preallocation (slot pools + reserved free space) to reduce churn and smooth long-run behavior.
  • Validate with soak tests at near-full conditions and track one or two real metrics (like P99 commit latency + sustained sets/min).

Read more