Is That Our Dog? Using IMU to Verify What the Collar Microphone Hears

Sri Manika Makam

12 Jun 2026 — 6 min read

A collar microphone can't tell whose bark it just heard. It picks up every sound within range, the dog wearing it, the beagle three doors down, the TV in the background. An audio classifier running on the firmware can label a sound as "bark," but it has no way of knowing whether the collar-wearer was actually the one barking. That ambiguity is exactly the problem we set out to fix.

This post walks through the rule-based approach we built to fuse IMU motion data with the on-device audio classifier's output, and what we learned running it against real collar data.

The core insight

A bark isn't just a sound, it's a whole-body event. When a dog barks, the muscles in its neck and throat fire in a rapid, high-energy snap, a directional head jerk that shows up unmistakably in a three-axis accelerometer on the collar. That jerk is orientation-invariant: it doesn't matter which way the dog is facing or how the collar happens to be mounted. The rate of change of acceleration, computed across all three axes at once, spikes hard during a real bark and stays quiet during rest and gentle movement.

The audio classifier on the firmware detects sound. The IMU tells us whether the body moved the way a bark would move it. Together they answer the question audio alone can't: whose bark was it?

Pipeline overview

Raw IMU data arrives from the firmware as JSON event records at roughly 100 Hz. We process it through six steps: parse and dedupe the JSON, split into contiguous epochs and resample to 50 Hz, compute the jerk_3d features, apply the rule-based classifier to assign bark, moving, or rest labels, correlate that with audio capture IDs, and generate a fusion plot for every audio clip.

Pipeline flow diagram from raw IMU JSON through resampling, jerk feature computation, classification, and fusion with audio

The rule-based classifier

The classifier runs on a single feature: jerk_3d, the 3D rate of change of acceleration across all three axes. This is orientation-invariant in a way a scalar jerk computed from acceleration magnitude isn't, since a scalar measure misses directional head snaps where individual axes spike sharply but the overall magnitude barely changes. The 3D formulation catches these regardless of which direction the neck actually moves.

The three-class decision comes down to two thresholds applied in order:

def classify_activity(j3d_max, j3d_mean):
    if j3d_max >= 6.5:    # g/s bark: sharp jerk peak
        return "bark"
    if j3d_mean < 0.20:   # g/s rest: near-zero sustained jerk
        return "rest"
    return "moving"

A bark gets triggered by a single sample crossing 6.5 g/s, the rapid head snap that goes along with a vocalization. Rest requires the mean jerk to stay below 0.20 g/s, a deliberate choice, since even a resting dog shifts occasionally, but those peaks stay brief and low. Everything else, walking, trotting, shaking, scratching, falls under moving.

We validated the 6.5 g/s threshold on labeled sessions, and it sits well above the walking range (1 to 4 g/s peak) and well below the full-bark range (10 to 90-plus g/s). We evaluated the classifier at two granularities: an epoch view, one label per continuous IMU segment, and a sliding window view, 2-second windows with 1-second overlap, which gives us a per-second activity timeline.

Results on actual data

We ran the analysis on a 4.5-hour session from one of our collar devices, which logged continuous IMU at 100 Hz and triggered 16 audio captures, each tagged with a capture ID linking the audio classifier's label to the concurrent IMU window.

Activity distribution

Across 6,951 sliding windows, the session broke down as 4,433 rest (64%), 1,238 moving (18%), and 1,280 bark (18%). The dog stayed quiet for most of the morning, with the large rest block running roughly from 11:47 through mid-afternoon, and bark activity clustering in the 11:00-11:45 and 15:00-15:30 windows.

jerk_3d separation by class

The three classes separate cleanly in jerk space. Rest epochs cluster near zero (jerk_3d_max below 0.5 g/s). Moving epochs span roughly 1 to 7 g/s. Bark epochs sit far above the 6.5 g/s threshold, with the strongest bouts reaching 88 to 105 g/s at peak. The two decision boundaries show up as clear gaps in the empirical distributions.

jerk_3d distribution histogram — jerk_3d_max distribution by activity class. Bark epochs cluster far above the threshold; rest is near zero.

Fusion results: where IMU and audio agree, and where they don't

The pipeline matched all 16 audio clips to their concurrent IMU windows and compared the result against human-verified ground truth. The audio classifier is still a work in progress, and in this session it returned "eating" for 9 clips, "panting" for 4, and "unclassified" for 3, with no "bark" label at all. What the IMU rule detected for each audio label category tells a different story: all 6 confirmed barks were correctly identified as IMU=bark, regardless of what the audio model said.

Chart comparing IMU-detected activity against audio classifier labels across all 16 clips

The 6 IMU false positives, clips where IMU said bark but ground truth was moving or rest, turned out to be cases of vigorous head-shaking or collar movement spiking above the bark threshold, a known limitation of a single-threshold rule that we cover in Future Work below.

A few individual clips are worth walking through. Per-clip fusion plots show the jerk_3d waveform against the three activity zones, with both the audio label and the IMU rule's label marked in the corner.

cid 5044, a confirmed bark (27 g/s peak): audio labeled it "eating" at 0.72 confidence, but the IMU shows a sustained series of peaks well above threshold, a clear vocal bout the audio classifier missed entirely.

Fusion plot cid 5044 confirmed bark — cid 5044: IMU correctly identifies a bark bout despite audio misclassifying as eating.

cid 5051, a confirmed rest: audio came back unclassified (0.44 confidence), and the IMU shows flat, near-zero jerk, the dog was just lying still. The fusion correctly suppresses this window, avoiding a false notification.

Fusion plot cid 5051 confirmed rest — cid 5051: Both IMU and audio agree this is a rest window — no notification warranted.

cid 5066, a false positive (IMU=bark, ground truth=moving): audio said "panting" at 0.68. There's one early spike above threshold, then sustained sub-threshold noise, meaning a single head movement was enough to trip the bark rule.

Fusion plot cid 5066 false positive — cid 5066: A brief collar movement fires the bark threshold. The mean jerk is low, which richer features would use to reject it.

cid 5052, a false positive (IMU=bark, ground truth=rest): a large burst at 3 to 5 seconds (54 g/s peak, 43 consecutive samples) that came from the dog rolling over rather than a sustained bark cadence.

Fusion plot cid 5052 false positive — cid 5052: High-magnitude collar movement during rest fires the bark rule. Burst duration and cadence features would distinguish this.

What the fusion buys you

Even at this early stage, before the audio classifier is fully trained, the IMU rule gives us a reliable independent signal. Its most useful role right now is suppression: when audio fires a capture but the IMU shows rest, the system can confidently filter that event out as environmental noise, a neighbor's dog, a TV, background sound. The IMU caught all 6 confirmed barks correctly in this session, and as the audio classifier improves, the two signals should start reinforcing each other more precisely.

Future work

The rule-based classifier is a deliberate starting point: interpretable, tunable, and easy to validate without any training data. Its main limitation is that jerk_3d_max on its own can't distinguish a bark from other high-energy collar events like vigorous head-shaking or rolling over. The next step is a small ML classifier, either a Random Forest or a compact 1D CNN, trained on labeled windows using richer features: jerk peakiness (how rhythmic the burst is), spectral energy distribution, and consecutive-sample counts. The infrastructure for feature extraction and model loading is already built into the pipeline, and the classifier slot is designed to accept a trained model once we have enough labeled data. We'll share that work in a follow-up post.