Audio Machine Learning — Alessio Verardo

Overview

A comparison of three ML approaches — classical SVMs, recurrent networks (GRU), and CNNs — on UrbanSound8K: 8 732 clips of urban sounds across 10 classes, evaluated with 10-fold cross-validation. The goal was to understand which representations and architectures work best for audio, and why.

All results use F1-Macro rather than accuracy — the dataset is imbalanced by up to 2.7×, so accuracy would reward ignoring the rare classes.

Three audio representations are used throughout. A spectrogram is a 2D image of a sound clip — time on the x-axis, frequency on the y-axis, brightness encoding energy. A mel-spectrogram uses a logarithmic frequency scale that matches human hearing. MFCCs compress the mel-spectrogram into ~20–50 numbers per frame, capturing the rough shape of the spectrum while discarding fine detail. Full exploratory analysis of these representations is in the EDA notebooks on GitHub.

Results at a glance

Model	Best variant	F1-Macro
CNN	Mel-spectrogram + augmentation	77.8%
SVM	MFCC + delta features	70.0%
GRU	MFCC + delta, class-weighted	66.1%

All experiments ranked by F1-Macro — hover for exact values. Colours indicate model family.

What I tried

SVM on MFCC — the classical baseline. Each audio clip is summarised as a vector of MFCCs (a compact description of the spectral shape), and an SVM is trained on those vectors. I ran a grid search over the number of coefficients and whether to add delta features (how fast the spectrum is changing). Delta features give a consistent +2% gain. The best SVM hits 70% F1 — a strong bar for deep models to beat.

SVM grid search — F1-Macro for each hyperparameter combination. Hover for values.

GRU on MFCC — instead of averaging over time, a recurrent network processes the MFCC sequence frame-by-frame. In theory it should capture temporal patterns the SVM misses. In practice it underperforms: urban sounds are defined by texture, not timing. The GRU overfits the training folds without learning generalisable patterns. More MFCC dimensions do not consistently help either. Full sweep and variant comparison on GitHub.

CNN on spectrogram — audio clips are converted to 2D images (time × frequency) and fed into a standard CNN. I tested two frequency scales (linear and mel), plus three design choices: input normalisation, data augmentation, and class weighting. The mel scale and augmentation are the two biggest wins.

Ablation — isolated effect of each design choice. Mel-spectrogram (top row), raw spectrogram (bottom row).

Augmentation adds +3.9% on mel but hurts the raw spectrogram. On the mel (log-frequency) scale, pitch shifts become linear translations — the CNN learns pitch-invariance for free. The raw spectrogram doesn't share this property.

Where models still struggle

air_conditioner ↔ engine_idling is unsolved by every model — both are continuous low-frequency drones with nearly identical spectrograms. The difference lies in subtle periodicity patterns that 4-second clips don't reliably expose. car_horn and gun_shot (rare, impulsive classes) are where the CNN most clearly beats the GRU: a brief broadband burst shows up as a sharp vertical stripe in the mel-spectrogram, a pattern CNNs detect naturally.

Normalised confusion matrices for the best variant of each family. Brighter diagonal = better. Hover for values.

Key takeaways

Start simple. The SVM at 70% F1 was a much higher bar than expected from a method with no learned features and no GPU. Deep learning only makes sense here when it can meaningfully beat that number.
Architecture fit matters more than capacity. The GRU has more parameters than the SVM, sees the full temporal sequence, and still loses. The issue is not data or compute — it's that urban sounds are defined by texture, not timing. Framing the problem as image classification with a CNN is what finally unlocks the gap.
Augmentation is cheap and effective. +3.9% F1 on the CNN with no architecture changes and no extra data — just smarter use of what's already there.
Some problems need a different approach, not a better model. Air conditioner vs. engine idling is unsolvable at this clip length regardless of architecture. Longer clips or explicit periodicity features would be needed.

Full code, training logs, and analysis notebook on GitHub.

Urban Sound Classification

Overview

Results at a glance

What I tried

Where models still struggle

Key takeaways