Crash-Safe Flash Buffers: Surviving a Dying Battery

Crash-Safe Flash Buffers: Surviving a Dying Battery

A collar-worn pet tracker lives in about the most hostile environment a storage engineer could design on purpose. The battery is tiny, the dog is unpredictable, and the wireless link to the hub drops every time the dog wanders behind the couch. Now picture the worst possible moment: the battery dies in the exact millisecond the firmware is writing a sensor frame to flash. Do we lose the whole file? Panic on the next boot? Silently replay garbage into the pet's health timeline? At Hoomanely, none of those are acceptable answers. This post is about the durability layer beneath our sensor queue — the part that decides what survives an outage, what survives a torn write, and, just as importantly, what we deliberately throw away.

The Problem: Two Different Ways to Lose Data

A continuous health monitor has to defend against two failure modes that look similar but are not.

The first is an outage. The wireless link to the hub goes down, but the inertial sensor keeps producing motion data at a steady rate. That data has nowhere to go in real time, so it has to land somewhere persistent until the link returns — otherwise every walk behind the couch becomes a hole in the activity record.

The second is a torn write. Flash sits behind a FAT filesystem, and FAT was never designed for a power supply that can vanish mid-operation. If the battery dies halfway through appending a frame, the file can end with a half-written record — a length field with no payload, or a payload that runs off the end of the file. Read that back naively and you either crash or, worse, feed corrupted samples into the pet's timeline.

Two more constraints make it harder. Flash capacity is small and shared with other data, so the buffer cannot grow without bound. And the persistence path must never block the real-time capture loop — a sensor that stalls waiting on a slow flash write is a sensor that drops samples.

The Approach: A Self-Describing Spill File

Our tracker uses a two-tier queue: a small in-RAM ring for low-latency staging, and a flash spill file that is the real persistence layer. The RAM tier and the drain logic are a story for another post — here the focus is the flash file and why it is built to be corrupt-tolerant by design.

The core idea is that the spill file must be self-describing. We refuse to keep a separate index or offset table, because that index becomes a second thing that can get out of sync with the data after a crash. Instead, every frame is wrapped with its length recorded both before and after the payload, so a reader can walk the file forward or backward without any external bookkeeping.

The file is append-only, capacity-bounded, and validated on read. Those three properties — append-only writes, a hard byte cap, and defensive parsing — are what turn "flash plus FAT" into something a battery-powered medical-adjacent device can actually trust.

The Process: Four Decisions That Make It Safe

1. One write per frame, and count only what fully landed. Each frame is assembled in a single contiguous buffer — leading length tag, payload, trailing length tag — and pushed to flash in exactly one write() syscall. We advance our running byte counter only if the full frame was written.

460	    framed[0] = (uint8_t)(got & 0xFF);
461	    framed[1] = (uint8_t)((got >> 8) & 0xFF);
462	    memcpy(&framed[2], buf, got);
463	    framed[2 + got]     = (uint8_t)(got & 0xFF);
464	    framed[2 + got + 1] = (uint8_t)((got >> 8) & 0xFF);
465	    if (write(s_spill_fd, framed, got + 4) == (ssize_t)(got + 4)) {
466	        s_spill_bytes += (long)(got + 4);
467	    }

The single-syscall write matters for more than speed. Three separate writes (tag, payload, tag) give a power cut three windows to tear a frame instead of one, and each FAT write grabs an internal lock — three per frame turned the spill task into the bottleneck under load. One write keeps the operation as close to atomic as the filesystem allows, and the dual tags let the reader later prove a frame is intact.

2. Flush before letting go. When the link comes back and the file changes hands from writer to reader, we don't just close the descriptor and hope the data reached the medium. We force it.

387	static void close_spill_fd(void)
388	{
389	    if (s_spill_fd >= 0) {
390	        fsync(s_spill_fd);
391	        close(s_spill_fd);
392	        s_spill_fd = -1;
393	    }
394	}

fsync() pushes any buffered data through to flash before the reader opens its own handle. Without it, the most recent frames could live only in a cache that a power cut would erase — present in our byte counter, absent on the medium.

3. Assume the file is already damaged. The reader never trusts the bytes on disk. Before each frame it reads the leading length tag and sanity-checks it. A zero length, an impossibly large length, or a frame that would run past the file's captured end means the file is corrupt — almost always a torn final write from a power loss — so we abandon it cleanly instead of crashing.

801	        uint16_t flen = (uint16_t)tag[0] | ((uint16_t)tag[1] << 8);
802	        if (flen == 0 || flen > MAX_FRAME_BYTES) {
803	            ESP_LOGE(TAG, "replay: bogus tag %u at pos %lld; abandoning file",
804	                     flen, (long long)s_replay_pos);
805	            close_replay_fd();
806	            remove(SPILL_PATH);
807	            apply_mode(compute_mode());
808	            goto out;
809	        }
810	}

We also capture the file's size once when we open it for reading, and treat any frame that claims to extend beyond that boundary as corrupt. The result: a torn frame is a logged, recoverable event — never a boot loop, never a panic, never a stream of garbage masquerading as the dog's movement.

4. Decide what is not worth keeping. This is the design decision I'm proudest of, because it's the one most teams get wrong by reflex. It is tempting to make the spill file durable across reboots — but the frames in it carry timestamps from a previous power session. Replaying day-old motion as if it were live would silently poison the health timeline. So on boot, we wipe it.

933	    /* Wipe any spill carried over from a prior boot. Frames in there
934	     * carry stale event_ts values from the previous session; replaying
935	     * them on next BLE connect would inject obsolete samples into the
936	     * timeline. The two-tier queue's job is real-time outage resilience,
937	     * not cross-reboot durability. */
938	    if (remove(SPILL_PATH) == 0) {
939	        ESP_LOGI(TAG, "wiped stale spill from prior boot");
940	    }

Crash-safety is not "never lose a byte." It's "never let a surviving byte lie." A buffer that resurrects stale data after a reset is more dangerous than one that admits the gap.

The Results

In practice this layer does three things reliably. It absorbs link outages up to a hard flash budget, so a dog out of range still has its activity recorded and re-delivered when the hub reconnects. It tolerates power loss at any instant: the worst a dying battery can do is leave one half-written frame, which the reader detects and discards on the next session. And it never stalls the sensors, because the durable write path runs on its own task and the producer never blocks on flash.

Equally important is what it refuses to do. It will not grow unbounded and starve other data. It will not crash on a corrupt file. And it will not replay yesterday's motion into today's record. The behavior is boring on purpose — boring is what you want from the component standing between a flaky power rail and a pet's medical history.

Why It Matters at Hoomanely

Hoomanely is reinventing healthcare for pets — replacing a reactive, imprecise, stressful model of care with continuous, clinical-grade monitoring that catches problems early. Our devices form a Physical Intelligence ecosystem: interconnected sensors fused at the edge, feeding the Biosense AI Engine that turns raw signals into personalized, preventive insights.

That promise rests on a quiet assumption — that the data reaching the AI engine is complete and honest. A daily activity trend is only meaningful if a dropped wireless link didn't quietly erase an afternoon, and only trustworthy if a battery swap didn't replay stale steps into a fresh day. The crash-safe spill buffer is where that integrity is enforced, frame by frame, before any model ever sees the data.

Our guiding principle is that every signal matters and every detail counts. In firmware terms, that means engineering for the millisecond the power dies — and being disciplined about what deserves to survive it. Getting durability right at the edge is what lets the rest of the system make confident claims about a pet's health.

This buffer is the persistence layer inside the collar-worn tracker that anchors Hoomanely's Physical Intelligence ecosystem. The sensors capture, this layer guarantees the data is whole and honest, the hub re-delivers it, and the Biosense AI Engine turns it into preventive insight. It's a small file format and a handful of guards — but it's the difference between a health record you can act on and one you can only hope is right.