Visualizing the Unknown: Anomaly Detection in Unlabeled Audio using UMAP

Visualizing the Unknown: Anomaly Detection in Unlabeled Audio using UMAP

The "Cold Start" Problem in Audio AI

In the lifecycle of any machine learning project, the most painful phase is often the very first one: Data Curation. Before you can train a model to detect a dog bark, you need thousands of labeled examples of dog barks. But where do those labels come from?

In the Computer Vision domain, this "Cold Start" problem is manageable. If you scrape 10,000 images from the web, you can open a file explorer, set the view to "Large Icons," and visually scan the dataset. The human brain is a highly parallel image processor; you can spot a corrupted image, a blank file, or an irrelevant photo in milliseconds. A single engineer can sanity-check thousands of images in an hour.

Teams working with audio do not have this luxury.

Audio is inherently temporal and opaque. You cannot "glance" at an MP3 file. To understand its content, you must consume it linearly. If you have 10,000 files averaging 5 seconds each, that is nearly 14 hours of continuous listening time. That means 14 hours just to know what data you have, let alone label it.

At Hoomanely, we are building advanced AI solutions to provide preventive healthcare to pet parents. To do this robustly, we rely on large-scale, real-world audio data. We faced a critical specific bottleneck: we had gathered a massive dataset of "potential" pet sounds, but we knew it was polluted with noise—TV commercials, traffic sounds, silence, and human speech. We needed a way to audit this data that didn't involve a week of manual listening.

We needed to turn the temporal problem of listening into a spatial problem of looking.


The Challenge: Visualising the Invisible

Why is audio so hard to visualise?

The Waveform Trap

The most basic representation of audio—the waveform—is simply a plot of amplitude over time. While it shows loudness, it is semantically empty. A recording of a dog barking and a recording of a car horn might both manifest as loud, jagged bursts of amplitude. Visually, they look identical.

The Spectrogram Hurdle

We often convert audio to spectrograms (visual representations of frequency over time) to train models. While spectrograms contain the information needed to distinguish a bark from a horn, they are still complex, high-dimensional images. Detailed patterns in the 200Hz-500Hz range might distinguish a "playful bark" from an "aggressive growl," but these subtleties are invisible to the naked eye of a human reviewer scrolling through thousands of files.

The Dimensionality Curse

To a computer, a 5-second audio clip at 44.1kHz is a vector of roughly 220,000 numbers. To make sense of this dataset, we need to compress those 220,000 dimensions down to exactly two: X and Y. If we can do that effectively, we can map our entire audio library onto a 2D scatter plot.

The goal was simple but ambitious: Proximity in the plot must equal semantic similarity in the real world.


The Landscape of Solutions

We evaluated three primary techniques for evaluating this high-dimensional structure. Here is how they stacked up for our specific use case of audio anomaly detection.

A. PCA (Principal Component Analysis)

The "Old Guard" of dimensionality reduction. PCA works by rotating the dataset in high-dimensional space to find the axes (Principal Components) that capture the most variance (spread) in the data.

  • How it works: It flattens the data onto the 2D plane that preserves the most "information" (variance).
  • The Problem: PCA assumes that the data lies on a linear plane (like a sheet of paper). However, complex real-world data like audio embeddings usually sits on a non-linear manifold (like a crumpled ball of paper). When PCA tries to flatten this crumpled ball, it smashes distinct layers together. In practice, our PCA plots just looked like one giant, amorphous blob. We couldn't tell the signal from the noise.

B. t-SNE (t-Distributed Stochastic Neighbour Embedding)

For the last decade, t-SNE has been the standard for visualising high-dimensional data in academic papers.

  • How it works: It calculates the probability of two points being neighbors in high-dimensional space and tries to match that probability in 2D space. It is incredibly good at keeping similar things together.
  • The Problem:
    1. It is Slow: Standard t-SNE scales poorly ($O(N^2)$) with large datasets.
    2. The "Global Structure" Issue: t-SNE focuses so obsessively on local neighbors that it often loses the big picture. Two clusters might appear on opposite sides of the plot, but that doesn't necessarily mean they are very different—it just means t-SNE put them there to make space. This unreliability made it dangerous for anomaly detection. We need to know if an outlier is truly an outlier, or just misplaced.

C. UMAP (Uniform Manifold Approximation and Projection)

The modern solution. UMAP is backed by rigorous algebraic topology but offers a practical breakthrough.

  • How it works: Like t-SNE, it constructs a graph of identifying neighbours. However, it uses a simpler mathematical foundation (fuzzy simplicial sets) that allows it to optimise the layout much faster.
  • Why we chose it:
    • Speed: It runs significantly faster than t-SNE, allowing us to iterate quickly.
    • Balance: It is the "Goldilocks" solution. It captures the tight local clusters of t-SNE (grouping all "barks" together) while preserving the global structure of PCA (keeping "animal sounds" far away from "machine noise").

Our Approach: The Manifold Pipeline

To solve our data quality/curation problem, we built a specific pipeline combining Deep Learning feature extraction with UMAP visualisation.

Step 1: Semantic Vectorisation (The "Ear")

You cannot feed raw audio waveforms into UMAP; the raw data is too noisy. We needed a translator. We utilized a Pre-trained Audio Encoder. Models like VGGishCLAP, or AudioMAE have been trained on millions of YouTube videos (AudioSet) to understand sound. We stripped off the final "classification" layer of the model and kept the penultimate "embedding" layer. By passing our raw audio through this model, we converted each file into a dense 1024-dimensional vector.

  • Insight: This vector is a semantic summary. The model compresses the texture of the sound—the timbre, the pitch, the rhythm—into a set of numbers.

Step 2: Manifold Projection (The "Map")

We fed these vectors into UMAP. We found that hyperparameter tuning was critical:

  • n_neighbors=15: A smaller number here helped us fragment the data into many small, tight micro-clusters rather than one big continent.
  • min_dist=0.1: This allowed points to pack tightly together, creating clear empty space (separation) between different types of sounds.

Step 3: Interactive Exploration (The "Tool")

The final piece of the puzzle was Interactivity. A static image of dots is interesting, but not actionable. You can't debug what you can't check. We built a lightweight web-based visualisation tool (using libraries like Plotly or Bokeh) where each point on the scatter plot was linked to the underlying audio file.

  • Hover: See the filename and metadata.
  • Click: Play the simple audio.
  • Lasso: Select a group of points and export their IDs to a CSV.

Results: Cleaning the Dataset

The moment we loaded our data into this tool, the "Unknown" became "Known". The results were glaringly obvious.

We saw our data form distinct "Islands":

  1. The Main Landmass: Most of the data formed a large, connected continent. Clicking around revealed this to be our high-quality pet sounds.
  2. The "Silence" Island: A tight, dense pinprick of points far away from everything else. Inspection revealed these were files where the audio trigger failed, resulting in 5 seconds of digital zero.
  3. The "Glitch" Cluster: A stringy, scattered group of points. These turned out to be corrupted audio files that sounded like static or screeching.
  4. The "Human" Peninsula: A cluster attached to the main landmass but extending outwards. These were interesting—they were clips of pets, but with owners talking loudly over them.

The Action: Instead of listening to 10,000 files, we simply "Lasso Selected" the Silence Island and the Glitch Cluster and deleted them. In 10 minutes of visual work, we removed significant part of our dataset that was garbage.


Key Takeaways

  1. Data Curation is Engineering: Don't outsource data cleaning. Building tools to visualise your data is high-leverage engineering work.
  2. Embeddings are Universal: You don't need labeled classes to start using Deep Learning. Use pre-trained embeddings to structure your chaotic data before you start your own training.
  3. Visual Debugging > Manual Auditing: Whenever possible, transform a linear, manual task (listening/reading) into a parallel, visual task (clustering). The human eye is the fastest outlier detector in the world—give it a map to look at.

The best model architecture in the world cannot fix a dataset full of noise. See your data first, train second.

Read more