A comparison of three ML approaches — classical SVMs, recurrent networks (GRU), and CNNs — on UrbanSound8K: 8 732 clips of urban sounds across 10 classes, evaluated with 10-fold cross-validation. The goal was to understand which representations and architectures work best for audio, and why.
All results use F1-Macro rather than accuracy — the dataset is imbalanced by up to 2.7×, so accuracy would reward ignoring the rare classes.
Three audio representations are used throughout. A spectrogram is a 2D image of a sound clip — time on the x-axis, frequency on the y-axis, brightness encoding energy. A mel-spectrogram uses a logarithmic frequency scale that matches human hearing. MFCCs compress the mel-spectrogram into ~20–50 numbers per frame, capturing the rough shape of the spectrum while discarding fine detail. Full exploratory analysis of these representations is in the EDA notebooks on GitHub.
| Model | Best variant | F1-Macro |
|---|---|---|
| CNN | Mel-spectrogram + augmentation | 77.8% |
| SVM | MFCC + delta features | 70.0% |
| GRU | MFCC + delta, class-weighted | 66.1% |
All experiments ranked by F1-Macro — hover for exact values. Colours indicate model family.
SVM on MFCC — the classical baseline. Each audio clip is summarised as a vector of MFCCs (a compact description of the spectral shape), and an SVM is trained on those vectors. I ran a grid search over the number of coefficients and whether to add delta features (how fast the spectrum is changing). Delta features give a consistent +2% gain. The best SVM hits 70% F1 — a strong bar for deep models to beat.
SVM grid search — F1-Macro for each hyperparameter combination. Hover for values.
GRU on MFCC — instead of averaging over time, a recurrent network processes the MFCC sequence frame-by-frame. In theory it should capture temporal patterns the SVM misses. In practice it underperforms: urban sounds are defined by texture, not timing. The GRU overfits the training folds without learning generalisable patterns. More MFCC dimensions do not consistently help either. Full sweep and variant comparison on GitHub.
CNN on spectrogram — audio clips are converted to 2D images (time × frequency) and fed into a standard CNN. I tested two frequency scales (linear and mel), plus three design choices: input normalisation, data augmentation, and class weighting. The mel scale and augmentation are the two biggest wins.
Ablation — isolated effect of each design choice. Mel-spectrogram (top row), raw spectrogram (bottom row).
air_conditioner ↔ engine_idling is unsolved by every model — both are continuous low-frequency drones with nearly identical spectrograms. The difference lies in subtle periodicity patterns that 4-second clips don't reliably expose. car_horn and gun_shot (rare, impulsive classes) are where the CNN most clearly beats the GRU: a brief broadband burst shows up as a sharp vertical stripe in the mel-spectrogram, a pattern CNNs detect naturally.
Normalised confusion matrices for the best variant of each family. Brighter diagonal = better. Hover for values.
Full code, training logs, and analysis notebook on GitHub.