Firmware

Motion Wakes the Mic: Fusing IMU and Audio on a Collar

Vaishak C

24 Jun 2026 — 6 min read

A barking dog tells you something. A lunging, barking dog tells you a lot more. On a battery-powered pet collar, the two most informative senses are motion and sound — but you can't run a microphone flat-out all day, and a motion sensor alone can't tell a bark from a yawn. So our tracker treats them as one system: the inertial sensor watches for movement continuously and cheaply, and when something happens, it wakes the microphone to record a short, context-rich clip. The two streams are then stitched back together by a shared timestamp so the backend sees a single event — "the dog moved, and here's exactly what it sounded like." This post walks through how that motion-triggered, preroll-buffered, time-correlated audio capture actually works in firmware.

The Problem: Two Senses, One Tiny Power Budget

A collar's inertial measurement unit (IMU) is happy to run all day — it's low-power and its motion data feeds the pedometer regardless. A microphone is a different beast. Recording, compressing, storing, and shipping 48 kHz audio is expensive in CPU, flash, and radio time, so it can't be always-on.

The naive fixes both fail. Record continuously and you flatten the battery and drown the wireless link. Record only on a fixed timer and you miss the moments that matter while capturing hours of silence.

There's a subtler trap too. Even if you trigger audio on an event, a microphone that starts recording when it detects the event has already missed the event's beginning — the sharp onset of a bark is over in tens of milliseconds, long before any trigger fires. And once you have a clip and a motion record, you still have to prove they describe the same moment. Getting all of this right is what turns two raw sensors into one behavioral signal.

The Approach: Let Motion Be the Trigger

The design principle is simple: the cheap sensor gates the expensive one. The IMU runs continuously and acts as the always-on sentinel; the microphone stays idle until motion arms it. That keeps average power low while ensuring the mic is recording whenever there's something worth hearing.

To make that robust, the motion detector can't just look at raw acceleration magnitude — a collar at rest still reads ~1 g of gravity, and a tilt isn't motion. Instead we track the peak-to-peak swing of the acceleration magnitude over a short rolling window, which captures dynamic movement while ignoring static orientation. A glitch guard then requires the threshold to hold across consecutive reads, so a single noisy IMU sample can't fire the mic on its own:

653	extern "C" bool accelorometer_is_moving(void)
654	{
655	    float p2p = accelorometer_motion_p2p();
656	    if (p2p < 0.0f) return false;
657	    /* Glitch guard: require the threshold to hold for >=2 consecutive reads so a
658	     * single duplicate/garbage IMU sample (the stream has occasional ones) can't
659	     * arm a clip on its own. is_moving() is polled ~once per audio block. */
660	    static int over = 0;
661	    if (p2p > ACCEL_MOTION_P2P_G) { if (over < 2) over++; }
662	    else                         { over = 0; }
663	    return over >= 2;
664	}

This single boolean — "is the dog moving?" — is the bridge between the two subsystems. The audio capture loop polls it roughly once per encoded audio block and arms a clip the moment it goes true (or when the hub forces a one-shot capture for a labeled dataset).

The Process: Preroll, Arm, and Correlate

Catching the onset with a preroll ring. To solve the "the bark is over before we trigger" problem, the encoder is always writing its most recent compressed audio blocks into a small circular look-back buffer, even while idle. When motion finally arms a clip, we flush that ring into the front of the recording — so the clip begins a couple of seconds before the trigger:

229	static int preroll_flush_to_clip(void)
230	{
231	    if (!s_preroll) return 0;
232	    int oldest = (s_preroll_head - s_preroll_count + PREROLL_BLOCKS) % PREROLL_BLOCKS;
233	    for (int i = 0; i < s_preroll_count; ++i) {
234	        clip_write_block(s_preroll[(oldest + i) % PREROLL_BLOCKS]);
235	    }
236	    return s_preroll_count;
237	}

The cost is tiny — a few kilobytes of heap holding pre-compressed blocks — and the payoff is that the onset that triggered capture sits at the head of the clip, not lost to trigger latency.

Arming and the shared correlation key. When motion (or a forced command) fires, the capture loop opens a clip and, crucially, stamps a capture id (cid) derived from the wall-clock millisecond timestamp — the same clock the IMU stamps onto its own sample frames. It then tells the sensor-frame writer to tag every concurrent IMU frame with that id:

1119	                ESP_LOGI(TAG, "clip armed by motion (%.2f g) cid=%llu",
1120	                         motion_g, (unsigned long long)clip_id);
1121	                /* Bound the IMU tagging window to the requested duration: clear
1122	                 * the capture_id at arm + N s even if clip_close is delayed by
1123	                 * transfer contention, so the IMU tagged with this cid spans
1124	                 * exactly the N-second interval, not the record+transfer window. */
1125	                cid_deadline_ms = ts_ms + (uint64_t)(target_blocks / AUDIO_BLOCKS_PER_SEC) * 1000ull;
1126	                clip_arm_ms = ts_ms;
1127	                file_operations_set_capture_id(clip_id);  /* tag concurrent IMU frames */

From that moment, the IMU producer writes the cid into every motion frame it emits, so the backend can join the audio clip with the exact window of motion that accompanied it:

283	    mpack_start_map(&w, cap_id ? 5 : 4);
284	    mpack_write_cstr(&w, "p");
285	    mpack_write_cstr_or_nil(&w, pckt_id);
286	    mpack_write_cstr(&w, "n");
287	    mpack_write_u32(&w, this_seq);
288	    if (cap_id) {
289	        mpack_write_cstr(&w, "cid");
290	        mpack_write_u64(&w, cap_id);
291	    }

Notice the discipline around the tagging window. The cid is cleared after exactly the clip's intended duration, even if shipping the clip over the radio is delayed — so the motion tagged with a given clip spans precisely the recorded interval, never the longer record-plus-transfer window. That keeps the fused event clean: the IMU samples sharing a clip's cid are only the ones that happened during the audio.

The Results

The combined behavior is a collar that hears the right things at the right times. Average power stays low because the mic sleeps through stillness; clips are dense with signal because they only open on movement; and each clip arrives with its bark onset intact and a window of synchronized motion attached. A still-but-vocal dog is the one tricky case, which is exactly why the trigger was designed as a hybrid that can also arm on sound — motion is the primary gate, but the architecture leaves room for an acoustic onset.

Just as important, the fusion costs almost nothing. The bridge between subsystems is a single boolean and a 64-bit id; there's no shared buffer to lock, no second clock to reconcile, and no heavyweight message bus. One sensor polls a flag, the other stamps an integer, and the join happens later on a far larger machine. That's the kind of cheap, robust coupling a milliwatt budget demands.

Why It Matters at Hoomanely

Hoomanely is reinventing healthcare for pets — replacing reactive, imprecise care with continuous, clinical-grade monitoring that catches problems early. Our devices form a Physical Intelligence ecosystem: sensors fused at the edge, feeding the Biosense AI Engine that turns raw signals into personalized, preventive insights.

Behavior is where many health issues first surface, and behavior is inherently multimodal. Separation anxiety looks like restless motion and sounds like whining; pain can show up as a change in gait and a change in vocalization. A motion stream and an audio stream analyzed in isolation each tell half the story; correlated by a shared cid, they let the Biosense engine reason about a complete event — what the dog did and what it sounded like, in the same moment.

Our guiding principle is that every signal matters and every detail counts. Waking the microphone with the IMU, capturing the onset with a preroll, and binding the two streams with one timestamp is how we honor that — turning two cheap sensors into a behavioral signal a clinician can trust.

Key Takeaways

Let the cheap sensor gate the expensive one. An always-on IMU as a motion sentinel keeps the power-hungry microphone idle until there's something worth recording.
Detect motion, not tilt. Triggering on peak-to-peak swing over a short window — with a consecutive-read glitch guard — separates real movement from gravity and sensor noise.
Preroll captures the onset. Continuously buffering recent compressed audio and flushing it on trigger puts the bark's beginning at the head of the clip instead of losing it to latency.
Correlate with a shared key. Stamping audio clips and IMU frames with the same timestamp-derived cid lets the backend join the two streams into one behavioral event.
Bound the tagging window. Clearing the correlation id after the clip's true duration keeps the fused motion window precise, independent of transfer delays.

Author's Note

This motion-gated audio pipeline runs on the collar-worn tracker that anchors Hoomanely's Physical Intelligence ecosystem. The IMU keeps watch, the microphone captures the moment with its onset intact, and a single shared id ties them together for the Biosense AI Engine. It's a small amount of firmware glue — a boolean and an integer — doing the quiet work of turning motion and sound into one trustworthy picture of how a pet is really doing.