Transformer-Guided Noise Cancellation: AudioSep as Teacher, ConvTasNet as Student

Transformer-Guided Noise Cancellation: AudioSep as Teacher, ConvTasNet as Student

Introduction

At Hoomanely, we capture audio only when the proximity sensor detects a dog near the bowl. This ensures the sound we record is directly tied to feeding or drinking activity - not hours of irrelevant background noise.

But even during those few seconds, real homes are chaotic acoustic environments: TVs, human conversation, clanking bowls, external noise, and room echo. Traditional DSP filters or Mel-band masking help somewhat, but they still leak speech and suppress the subtle transients we care about (chewing spikes, micro-gulps, lapping patterns).

This pushed us toward a more modern and precise approach: a two-stage learning-based noise-cancellation pipeline where a transformer separation model (AudioSep) becomes the teacher, and a lightweight ConvTasNet model becomes the student deployed on-device.

This article walks through how we built it, why AudioSep outperforms handcrafted band-pass filters, and how this improves overall dog-sound understanding at Hoomanely.


1. Why Classical Filtering Was Not Enough

Early attempts relied on a mix of:

  • high-pass filtering around ~100 Hz
  • low-pass filtering around ~5–6 kHz
  • Mel-band suppression (reducing speech-dominant Mel bins)

These helped reduce steady-state noise but failed with:

  • human speech mixed with dog chewing
  • TV dialogue overlapping mid-bands
  • sudden metal impacts
  • reverberant bowls amplifying noise

More importantly - these DSP methods often removed parts of the dog’s actual eating signal, confusing our downstream classifiers.

We needed a cleaner supervisory signal.


2. Enter AudioSep: Transformer-Based Separation with Text Queries

AudioSep is a state-of-the-art, transformer-based audio separation model that isolates sounds based on a natural-language query.

Instead of telling the model how to filter sound, we simply tell it what we want:

  • “dog chewing kibble”
  • “dog drinking water”
  • “dog eating food”

AudioSep then extracts only the audio components semantically matching that description.

Why AudioSep Generates Better Teacher Targets

Compared to DSP or Mel-band masks, AudioSep delivers:

  • Much cleaner separation of dog sounds from background speech
  • Retention of critical transients (tiny crunches, lapping spikes)
  • No dependency on frequency overlap
  • Stable pseudo-labels across different rooms, mics, and bowls

In short: Mel band-pass cares about frequency. AudioSep cares about meaning.

This makes it the ideal teacher model.



3. Dataset Creation Using AudioSep Pseudo-Labels

To train ConvTasNet, we need paired data:

  • X_noisy: Real bowl audio (with environmental noise)
  • Y_clean: Clean dog-only audio

But we cannot record real clean signals. AudioSep becomes the generator for pseudo-ground truth.

Step-by-step process

  1. Record proximity-triggered audio during eating/drinking.
  2. Run AudioSep with a relevant query (“dog drinking water”).
  3. AudioSep outputs clean dog-only audio → becomes Y_clean.
  4. Y_clean is mixed with various noise samples (speech, fan hum, utensils) to create additional X_noisy variations.
  5. Each (X_noisy, Y_clean) pair becomes a training example.

This gives us a large, high-quality dataset without manually annotating anything.

Why not use Mel-filtered output as ground truth?

  • Mel filtering suppresses dog transients.
  • Leakage from speech is unpredictable.
  • Artifacts confuse the ConvTasNet model.

AudioSep fixes all of this.


4. ConvTasNet: The Lightweight Student for On-Device Inference

Once we have clean pseudo-labels, we train a ConvTasNet, which is:

  • fast
  • small enough for on-device inference
  • excellent at separating non-stationary, transient sounds

Architecture Overview

ConvTasNet includes:

  • 1D convolutional encoder
  • stacks of dilated depthwise conv blocks
  • 1D decoder reconstructing clean waveforms

Why ConvTasNet Works Well Here

Dog chewing and lapping sounds are:

  • transient
  • broadband
  • rhythmic
  • non-stationary

ConvTasNet captures these patterns far better than RNNoise-style generic denoisers.



5. Why AudioSep Beats Mel Band-Pass Filtering

1. Frequency overlap isn’t a limitation

Speech overlaps heavily with dog chewing in linear and Mel frequency scales. Filters can’t fully separate them.

AudioSep doesn’t care about overlap - it uses semantic cues.

2. Transient preservation

Mel suppression often kills the tiniest crunches and licks. These contain the most important information.

AudioSep preserves them almost perfectly.

3. Consistency across rooms & bowls

Mel filtering changes depending on:

  • microphone gain
  • room acoustics
  • background volume

AudioSep yields stable, reproducible outputs.

4. Cleaner supervision → better ConvTasNet

Training on clean AudioSep signals allows ConvTasNet to:

  • learn correct dog-sound patterns
  • suppress irrelevant noise
  • generalize better on-device

This directly improves detection accuracy.


6. Results on Real Devices

After deploying the AudioSep → ConvTasNet pipeline:

  • Speech suppression improved by 10–12 dB.
  • Chewing detection accuracy increased by ~20%.
  • Drinking onset detection became far more stable.
  • Background hum and utensil sounds were heavily reduced.
  • Dog transients were preserved significantly better.


7. How This Strengthens Hoomanely’s Mission

Hoomanely aims to understand dogs at a deeper sensory level - what they eat, how fast they drink, how their patterns change over time. Audio quality directly impacts our insights.

This pipeline allows us to:

  • accurately detect eating/drinking events
  • identify anomalies earlier (reduced appetite, gulping irregularities)
  • build reliable health signals
  • run efficient models on edge hardware

Better sound → better insights → healthier pets.


Key Takeaways

  • DSP + Mel filtering wasn’t enough for real home noise.
  • AudioSep acts as the perfect semantic teacher for dog-sound separation.
  • ConvTasNet becomes a lightweight, accurate student for on-device inference.
  • AudioSep-generated pseudo-labels dramatically improve model quality.
  • This pipeline directly enhances Hoomanely’s feeding-related health insights.

Read more