When MelCNN Hallucinates: Why Clean Spectrograms Still Produce Wrong Sounds

When MelCNN Hallucinates: Why Clean Spectrograms Still Produce Wrong Sounds

How MelCNN learned “ghost sounds” from mislabeled silence


Introduction

Hallucination is usually discussed in the context of large language models—confident answers to questions that were never asked, or facts that never existed. But hallucination isn’t exclusive to text models. In audio systems, especially CNN‑based classifiers like MelCNN, hallucination shows up in a quieter, more dangerous form: high‑confidence predictions for sounds that never happened.

We ran into this exact failure mode while deploying MelCNN for indoor dog‑audio classification. The spectrograms looked clean. The pipeline was stable. Offline metrics were acceptable. And yet, in production, the model kept firing on silence, fans, distant traffic hum, and room reverberations—confidently labeling them as meaningful sound events.

Cover Image Prompt (Canva / Midjourney / DALL·E):
A clean mel spectrogram fading into abstract ghost‑like audio waves, minimalist technical illustration style, dark background, subtle glow, representing audio hallucination in machine learning

This post breaks down why MelCNN hallucinated, how mislabeled silence became “ghost sounds,” and what actually worked to fix it—beyond just adding more data.


The Problem: Clean Spectrograms, Wrong Sounds

At first glance, nothing looked wrong:

  • Audio was preprocessed correctly
  • Mel spectrograms were normalized
  • Training loss converged smoothly
  • Validation accuracy looked fine

But when we inspected false positives, a pattern emerged:

Most hallucinations happened on low‑information audio—near silence, steady background noise, or transitional frames with faint energy.

The model wasn’t confused.

It was overconfident.

The core issue

MelCNN wasn’t hallucinating randomly. It was responding to learned correlations that shouldn’t have existed.

And those correlations came from the dataset.


How Silence Became a Class (Accidentally)

1. Mislabeled negatives were not truly negative

Our “negative” samples often included:

  • Truncated clips with leading or trailing silence
  • Indoor ambient noise recorded during real events
  • Low‑energy audio mistakenly labeled as positives

To a human listener, these sounded like nothing.

To a CNN, they had structure.


2. Mel spectrograms compress away semantics

Mel spectrograms:

  • Emphasize energy patterns
  • Compress frequency resolution
  • Lose phase and temporal nuance

This means very different sounds can look similar once reduced to mel space—especially at low energy.

Silence plus a faint hum can look closer to a real event than you’d expect.


3. CNNs reward consistency, not meaning

MelCNN learned:

“Whenever I see this energy blob, I get rewarded.”

It didn’t learn what the sound was.

It learned what patterns paid off during training.

That’s where hallucination begins.


Detecting the Hallucination: False Positives Tell the Truth

The breakthrough didn’t come from metrics—it came from false positive mining.

We took every false positive from real usage and asked:

  • Where does this sit in embedding space?
  • Does it form a cluster?
  • What does it sound like?

What we found

  • False positives were not scattered
  • They formed tight, repeatable clusters
  • Many clusters were dominated by:
    • Silence + noise
    • HVAC / fan sounds
    • Low‑frequency room hum
    • Transitional audio boundaries

This confirmed the model wasn’t unstable—it was consistently hallucinating the same ghost sounds.


Fix #1: Explicit False Positive Mining

Instead of treating false positives as errors, we treated them as new data sources.

What we did

  • Logged high‑confidence false positives from production
  • Clustered them using embeddings
  • Manually verified cluster themes (not every clip)
  • Promoted clusters into:
    • neg_hard
    • neg_silence
    • neg_ambient

This immediately changed the training dynamics.

The model was no longer rewarded for recognizing “almost sounds.”


Fix #2: Silence Is a First‑Class Concept

One subtle but critical change:

We stopped treating silence as absence of class.

Instead:

  • Silence became an explicit negative concept
  • We added multiple types of silence:
    • Clean silence
    • Noisy silence
    • Transitional silence
    • Device‑specific silence (mic noise floors)

This reduced ambiguity in the decision boundary.

The model learned:

“Low energy doesn’t imply a weak event—it usually means no event.”

Fix #3: Energy‑Aware Training Signals

MelCNN was overly sensitive to faint activations. So we added energy‑aware constraints:

  • Ignored predictions below a minimum RMS / energy threshold
  • Penalized confident predictions on low‑energy frames
  • Augmented training with:
    • Gain jitter
    • Background‑only mixes
    • Artificial dropouts

This discouraged the model from treating noise as signal.


What Improved (And What Didn’t)

Improved significantly

  • False positives on silence and ambient noise
  • Prediction stability across environments
  • Trust in high‑confidence outputs

Did NOT help much

  • Simply adding more random negatives
  • Increasing model depth
  • Training longer

Hallucination wasn’t a capacity problem.

It was a data semantics problem.


Key Takeaways

  • Audio hallucination looks like confidence, not randomness
  • Clean spectrograms don’t guarantee clean labels
  • Silence must be modeled, not ignored
  • False positives are the best dataset you already have
  • Validators beat bigger classifiers in production systems

MelCNN didn’t fail because it was weak.

It failed because it learned exactly what we taught it - including the ghosts hidden in our data.

Read more