Why Transfer Learning Failed For Our Audio Noise Cancellation Pipeline

Surya

30 Dec 2025 — 4 min read

Introduction

Transfer learning usually feels like a shortcut that just works. Pretrain on a large dataset, fine-tune on your task, and let scale do the heavy lifting. That approach worked well for us in vision - but audio broke that assumption in a very specific, painful way.

At Hoomanely, we were trying to build a privacy-preserving dog audio pipeline: detect eating sounds, suppress human speech, and later enable anomaly detection — all on edge devices. We relied heavily on a prompt-based audio separation transformer as a teacher model, assuming its outputs would be “good enough” to fine-tune downstream models.

They weren’t.

This post explains why transfer learning failed in our audio setup, how a seemingly small 20% noisy output completely derailed training, and why a smaller, cleaner dataset + diffusion-based augmentation + YAMNet fine-tuning outperformed teacher–student distillation by a wide margin.

Context: Audio at Hoomanely

Hoomanely’s mission is to convert raw pet signals into daily, actionable intelligence - responsibly and privately. Audio is one of the hardest modalities in that equation.

Our constraints shaped everything:

Always-on microphones in homes
Strict privacy requirements (human speech must not leak)
Edge inference (limited compute, low latency)
Extremely subtle target signals (chewing, licking, bowl movement)

Unlike vision, audio errors are invisible. You don’t see what went wrong — you hear it, and humans are notoriously bad at auditing thousands of clips.

The Original Plan

The original pipeline looked reasonable:

Collect raw bowl audio
Use an audio separation transformer with prompts like
“dog eating, chewing, drinking”
Treat the separated output as clean ground truth
Fine-tune downstream models (Conv-TasNet / classifiers)
Deploy on edge

This was essentially teacher–student distillation, with the transformer acting as a high-capacity oracle.

On paper, it made sense.

Where Things Started Breaking

The 20% Problem

Roughly 20% of the transformer outputs were bad.

Not catastrophically bad - just subtly wrong:

Residual human speech buried under chewing
Over-suppressed transients
Artificial gating artifacts
Frequency smearing that didn’t exist in real data

The issue wasn’t the percentage.

The issue was verification.

Why Audio Is Harder to Verify Than Vision

In vision pipelines:

A bad label is visible instantly
You can scroll through 500 images in minutes
Annotation errors are obvious

In audio:

Each clip takes 5–10 seconds
Artifacts are subtle
Fatigue sets in quickly
Humans disagree on what “clean” means

We could not reliably filter that 20% out.

And the model definitely learned from it.

Transformer artifacts often look plausible but distort critical frequency patterns.

Why Transfer Learning Failed (Specifically Here)

1. Teacher Errors Became Student Bias

Teacher–student learning assumes:

“Teacher errors are rare and random.”

Ours were systematic.

The model learned:

Transformer artifacts as “normal”
Speech leakage as acceptable background
Over-smoothed chewing as the target texture

The student didn’t generalize — it overfit the teacher’s mistakes.

2. AudioSep Outputs Were Not Auditable at Scale

We couldn’t:

Confidently label good vs bad outputs
Assign trust scores
Weight samples correctly

So every epoch reinforced noise.

In hindsight, this wasn’t a modeling failure - it was a data trust failure.

3. Distillation Multiplied the Damage

Teacher–student setups amplify bias:

Teacher mistakes → student representations
Student trained longer → stronger bias
Downstream tasks inherit both

Each iteration moved us further from real dog audio.

The Pivot: Smaller, Cleaner, Verifiable Data

Instead of asking “How do we fix the model?”, we asked:

“What data can we actually trust?”

The new approach

Manually curate a small, clean audio dataset
Fewer samples, but 100% verified
Real recordings, no separation artifacts

This dataset was tiny compared to our generated corpus — but it was honest.

Data Strategy Shift

Old Strategy	New Strategy
Large synthetic dataset	Small real dataset
Transformer-generated targets	Human-verified audio
Teacher–student learning	Direct supervised learning
Hard to audit	Easy to trust

This change alone stabilized training.

Augmenting with Diffusion Models (The Right Way)

The obvious concern was scale.

So instead of trusting separation transformers again, we used audio diffusion models to augment, not generate labels.

Key difference:

Conditioned on clean dog audio
Used for variation, not truth
Human-verified seed data only

Augmentations included:

Temporal stretching
Subtle pitch variance
Environmental noise layering
Room impulse responses

Crucially, we could still recognize the sound.

Diffusion adds variation without inventing structure.

Fine-Tuning YAMNet (Why It Worked)

With clean + augmented data, we fine-tuned YAMNet for dog-specific sounds.

Why YAMNet?

Lightweight
Strong general audio priors
Stable embeddings
Designed for environmental audio (not speech separation)

We fine-tuned only the upper layers, keeping lower representations intact.

Results vs Teacher–Student Distillation

The results were unambiguous:

Higher classification accuracy
Lower false positives
No speech hallucinations
More stable embeddings
Better edge performance

Most importantly, this setup outperformed the AudioSep-based teacher–student pipeline — despite using far less data.

Clean data + diffusion augmentation beat noisy distillation.

Why This Worked

1. Clean Data Beats Large Data

A small dataset you trust will always outperform a large one you don’t — especially in audio.

2. Diffusion Preserves Semantics

Diffusion models add variation, not structure. That distinction matters.

3. YAMNet Matched the Domain Better

Environmental audio ≠ speech separation. Using the right prior mattered more than model size.

How This Strengthens Hoomanely

This shift directly improved Hoomanely’s platform:

Stronger privacy guarantees (no speech leakage learned)
Better edge stability
More reliable downstream anomaly detection
Faster iteration cycles (less debugging invisible errors)

It reinforced an internal principle we now follow strictly:

Never scale audio data you cannot confidently audit.

Key Takeaways

Transfer learning can fail due to data trust, not model choice
Audio separation transformers are powerful — but risky as teachers
Manual verification is harder in audio than vision
Clean seed data + diffusion augmentation scales better
YAMNet fine-tuning beat teacher–student distillation in our case

Final Thought

Transfer learning didn’t fail because the models were weak.

It failed because we trusted the wrong data source.

Once we treated audio like the fragile signal it is - and optimized for trust over scale - everything else fell into place.