Is That Our Dog? Using IMU to Verify What the Collar Microphone Hears

Is That Our Dog? Using IMU to Verify What the Collar Microphone Hears

A collar microphone cannot tell whose bark it heard. It picks up every sound within range — the dog wearing it, the beagle three doors down, the TV in the background. An audio classifier running on the firmware can label a sound as "bark," but it has no way of knowing whether the collar-wearer was the one doing the barking. That ambiguity is precisely the problem we set out to fix.

In this post we walk through the rule-based approach we developed to fuse IMU motion data with the on-device audio classifier output — and share what we learned from running it on real collar data.

The Core Insight

A bark is not just a sound. It is a whole-body event. When a dog barks, the muscles in the neck and throat fire in a rapid, high-energy snap — a directional head jerk that shows up unmistakably in a three-axis accelerometer on the collar. That jerk is orientation-invariant: it doesn't matter which way the dog is facing or how the collar is mounted. The rate of change of acceleration, computed across all three axes simultaneously, spikes hard during a real bark and stays quiet during rest and gentle movement.

The audio classifier on the firmware detects sound. The IMU tells us whether the body moved like a bark. Together they answer the question audio alone cannot: whose bark was it?

Pipeline Overview

Raw IMU data arrives from the firmware as JSON event records at roughly 100 Hz. We built the pipeline to process it in six steps: parse and dedup the JSON, split into contiguous epochs and resample to 50 Hz, compute jerk_3d features, apply the rule-based classifier to assign bark/moving/rest labels, correlate with audio capture IDs, and generate a fusion plot for every audio clip.

The Rule-Based Classifier

The classifier uses a single feature: jerk_3d, defined as the 3D rate of change of acceleration across all three axes. This is orientation-invariant — a scalar jerk computed from acceleration magnitude misses directional head snaps, where individual axes spike but the magnitude may not change much. The 3D formulation catches them regardless of which direction the neck moves.

The three-class decision is two thresholds applied in order:

def classify_activity(j3d_max, j3d_mean):
    if j3d_max >= 6.5:    # g/s bark: sharp jerk peak
        return "bark"
    if j3d_mean < 0.20:   # g/s rest: near-zero sustained jerk
        return "rest"
    return "moving"

Bark is triggered by a single sample crossing 6.5 g/s — the rapid head snap that accompanies a vocalization. Rest requires the mean jerk to stay below 0.20 g/s, a deliberate choice: even a resting dog occasionally shifts, but those peaks stay brief and low. Everything else is moving: walk, trot, shake, or scratch.

The 6.5 g/s threshold was validated on labelled sessions — well above the walking range (1–4 g/s peak) and well below the full-bark range (10–90+ g/s). We evaluated the classifier at two granularities: an epoch view (one label per continuous IMU segment) and a sliding window view (2-second windows, 1-second overlap) that gives a per-second activity timeline.

Results on Actual Data

We ran the analysis on a 4.5-hour session from one of our collar devices. The device logged continuous IMU at 100 Hz and triggered 16 audio captures, each tagged with a capture ID that links the audio classifier label to the concurrent IMU window.

Activity Distribution

Across 6,951 sliding windows, the session broke down as 4,433 rest (64%), 1,238 moving (18%), and 1,280 bark (18%). The dog was quiet for most of the morning — the large rest block runs roughly 11:47 to mid-afternoon — with bark activity clustering in the 11:00–11:45 and 15:00–15:30 windows.

Activity distribution bar chart
Activity distribution across 6,951 two-second windows in a 4.5-hour session.

jerk_3d Separation by Class

The three classes separate cleanly in jerk space. Rest epochs cluster near zero (jerk_3d_max < 0.5 g/s). Moving epochs span roughly 1–7 g/s. Bark epochs extend far above the 6.5 g/s threshold, with the strongest bouts reaching 88–105 g/s peak. The two decision boundaries are visible as clear gaps in the empirical distributions.

jerk_3d distribution histogram
jerk_3d_max distribution by activity class. Bark epochs cluster far above the threshold; rest is near zero.

Fusion Results: Where IMU and Audio Agree (and Disagree)

The pipeline matched 16 audio clips to their concurrent IMU windows and compared against human-verified ground truth. The audio classifier is still a work in progress — in this session it returned "eating" for 9 clips, "panting" for 4, and "unclassified" for 3, with no "bark" labels at all. The chart below shows what the IMU rule detected for each audio label category: all 6 confirmed barks were correctly identified as IMU=bark, regardless of what the audio model said.

The 6 IMU false positives (clips where IMU=bark but ground truth is moving or rest) are cases of vigorous head-shaking or collar movement that spike above the bark threshold — a known limitation of the single-threshold rule discussed in Future Work below.

Per-clip fusion plots show the jerk_3d waveform against the three activity zones, with both the audio label and IMU rule label in the corner.

cid 5044 — confirmed bark (jerk 27 g/s peak). Audio said "eating" at 0.72 confidence. IMU shows a sustained series of peaks well above threshold — a clear vocal bout the audio classifier completely missed.

Fusion plot cid 5044 confirmed bark
cid 5044: IMU correctly identifies a bark bout despite audio misclassifying as eating.

cid 5051 — confirmed rest. Audio was unclassified (0.44 confidence). IMU shows flat near-zero jerk — the dog was lying still. The fusion correctly suppresses this, preventing a false notification.

Fusion plot cid 5051 confirmed rest
cid 5051: Both IMU and audio agree this is a rest window — no notification warranted.

cid 5066 — false positive (IMU=bark, GT=moving). Audio said "panting" (0.68). One early spike above threshold then sustained sub-threshold noise — a single head movement triggered the bark rule.

Fusion plot cid 5066 false positive
cid 5066: A brief collar movement fires the bark threshold. The mean jerk is low, which richer features would use to reject it.

cid 5052 — false positive (IMU=bark, GT=rest). A large burst at t=3–5s (54 g/s peak, 43 consecutive samples) from the dog rolling over rather than a sustained bark cadence.

Fusion plot cid 5052 false positive
cid 5052: High-magnitude collar movement during rest fires the bark rule. Burst duration and cadence features would distinguish this.

What the Fusion Buys You

Even at this early stage — before the audio classifier is fully trained — the IMU rule provides a reliable independent signal. Its most practically useful role right now is suppression: when audio fires a capture but the IMU shows rest, the system can confidently filter that event as environmental noise (a neighbor's dog, a TV, background sound). The IMU caught all 6 confirmed barks correctly. As the audio classifier improves, the two signals will reinforce each other more precisely.

Future Work

The rule-based classifier is a deliberate starting point — interpretable, tuneable, and straightforward to validate without any training data. Its main limitation: jerk_3d_max alone cannot distinguish bark from other high-energy collar events like vigorous head-shaking or rolling. The next step is a small ML classifier — Random Forest or compact 1D CNN — trained on labelled windows using richer features: jerk peakiness (rhythmicity of the burst), spectral energy distribution, and consecutive-sample counts. The infrastructure for feature extraction and model loading is already in place in the pipeline; the classifier slot is designed to accept a trained model once we have enough labelled data. We'll share that work in a follow-up post.

Read more