Why Transfer Learning Failed For Our Audio Noise Cancellation Pipeline

Why Transfer Learning Failed For Our Audio Noise Cancellation Pipeline

Introduction

Transfer learning usually feels like a shortcut that just works. Pretrain on a large dataset, fine-tune on your task, and let scale do the heavy lifting. That approach worked well for us in vision - but audio broke that assumption in a very specific, painful way.

At Hoomanely, we were trying to build a privacy-preserving dog audio pipeline: detect eating sounds, suppress human speech, and later enable anomaly detection — all on edge devices. We relied heavily on a prompt-based audio separation transformer as a teacher model, assuming its outputs would be “good enough” to fine-tune downstream models.

They weren’t.

This post explains why transfer learning failed in our audio setup, how a seemingly small 20% noisy output completely derailed training, and why a smaller, cleaner dataset + diffusion-based augmentation + YAMNet fine-tuning outperformed teacher–student distillation by a wide margin.


Context: Audio at Hoomanely

Hoomanely’s mission is to convert raw pet signals into daily, actionable intelligence - responsibly and privately. Audio is one of the hardest modalities in that equation.

Our constraints shaped everything:

  • Always-on microphones in homes
  • Strict privacy requirements (human speech must not leak)
  • Edge inference (limited compute, low latency)
  • Extremely subtle target signals (chewing, licking, bowl movement)

Unlike vision, audio errors are invisible. You don’t see what went wrong — you hear it, and humans are notoriously bad at auditing thousands of clips.


The Original Plan

The original pipeline looked reasonable:

  1. Collect raw bowl audio
  2. Use an audio separation transformer with prompts like
    “dog eating, chewing, drinking”
  3. Treat the separated output as clean ground truth
  4. Fine-tune downstream models (Conv-TasNet / classifiers)
  5. Deploy on edge

This was essentially teacher–student distillation, with the transformer acting as a high-capacity oracle.

On paper, it made sense.


Where Things Started Breaking

The 20% Problem

Roughly 20% of the transformer outputs were bad.

Not catastrophically bad - just subtly wrong:

  • Residual human speech buried under chewing
  • Over-suppressed transients
  • Artificial gating artifacts
  • Frequency smearing that didn’t exist in real data

The issue wasn’t the percentage.

The issue was verification.


Why Audio Is Harder to Verify Than Vision

In vision pipelines:

  • A bad label is visible instantly
  • You can scroll through 500 images in minutes
  • Annotation errors are obvious

In audio:

  • Each clip takes 5–10 seconds
  • Artifacts are subtle
  • Fatigue sets in quickly
  • Humans disagree on what “clean” means

We could not reliably filter that 20% out.

And the model definitely learned from it.

Image

Transformer artifacts often look plausible but distort critical frequency patterns.


Why Transfer Learning Failed (Specifically Here)

1. Teacher Errors Became Student Bias

Teacher–student learning assumes:

“Teacher errors are rare and random.”

Ours were systematic.

The model learned:

  • Transformer artifacts as “normal”
  • Speech leakage as acceptable background
  • Over-smoothed chewing as the target texture

The student didn’t generalize — it overfit the teacher’s mistakes.


2. AudioSep Outputs Were Not Auditable at Scale

We couldn’t:

  • Confidently label good vs bad outputs
  • Assign trust scores
  • Weight samples correctly

So every epoch reinforced noise.

In hindsight, this wasn’t a modeling failure - it was a data trust failure.


3. Distillation Multiplied the Damage

Teacher–student setups amplify bias:

  • Teacher mistakes → student representations
  • Student trained longer → stronger bias
  • Downstream tasks inherit both

Each iteration moved us further from real dog audio.


The Pivot: Smaller, Cleaner, Verifiable Data

Instead of asking “How do we fix the model?”, we asked:

“What data can we actually trust?”

The new approach

  • Manually curate a small, clean audio dataset
  • Fewer samples, but 100% verified
  • Real recordings, no separation artifacts

This dataset was tiny compared to our generated corpus — but it was honest.


Data Strategy Shift

Old StrategyNew Strategy
Large synthetic datasetSmall real dataset
Transformer-generated targetsHuman-verified audio
Teacher–student learningDirect supervised learning
Hard to auditEasy to trust

This change alone stabilized training.


Augmenting with Diffusion Models (The Right Way)

The obvious concern was scale.

So instead of trusting separation transformers again, we used audio diffusion models to augment, not generate labels.

Key difference:

  • Conditioned on clean dog audio
  • Used for variation, not truth
  • Human-verified seed data only

Augmentations included:

  • Temporal stretching
  • Subtle pitch variance
  • Environmental noise layering
  • Room impulse responses

Crucially, we could still recognize the sound.

Image

Diffusion adds variation without inventing structure.


Fine-Tuning YAMNet (Why It Worked)

With clean + augmented data, we fine-tuned YAMNet for dog-specific sounds.

Why YAMNet?

  • Lightweight
  • Strong general audio priors
  • Stable embeddings
  • Designed for environmental audio (not speech separation)

We fine-tuned only the upper layers, keeping lower representations intact.


Results vs Teacher–Student Distillation

The results were unambiguous:

  • Higher classification accuracy
  • Lower false positives
  • No speech hallucinations
  • More stable embeddings
  • Better edge performance

Most importantly, this setup outperformed the AudioSep-based teacher–student pipeline — despite using far less data.

Image

Clean data + diffusion augmentation beat noisy distillation.


Why This Worked

1. Clean Data Beats Large Data

A small dataset you trust will always outperform a large one you don’t — especially in audio.

2. Diffusion Preserves Semantics

Diffusion models add variation, not structure. That distinction matters.

3. YAMNet Matched the Domain Better

Environmental audio ≠ speech separation. Using the right prior mattered more than model size.


How This Strengthens Hoomanely

This shift directly improved Hoomanely’s platform:

  • Stronger privacy guarantees (no speech leakage learned)
  • Better edge stability
  • More reliable downstream anomaly detection
  • Faster iteration cycles (less debugging invisible errors)

It reinforced an internal principle we now follow strictly:

Never scale audio data you cannot confidently audit.

Key Takeaways

  • Transfer learning can fail due to data trust, not model choice
  • Audio separation transformers are powerful — but risky as teachers
  • Manual verification is harder in audio than vision
  • Clean seed data + diffusion augmentation scales better
  • YAMNet fine-tuning beat teacher–student distillation in our case

Final Thought

Transfer learning didn’t fail because the models were weak.

It failed because we trusted the wrong data source.

Once we treated audio like the fragile signal it is - and optimized for trust over scale - everything else fell into place.

Read more