On-the-fly audio processing using nnAudio

Dorien Herremans
8 min readJul 29, 2022

Step-by-step tutorial of how to use nnAudio to tackle the keyword spotting (KWS) task

Have you heard about nnAudio? A PyTorch tool for Audio Processing using GPU.

I want to introduce a very handy trainable front-end tool to all of you — nnAudio [1]. After providing some background, we will dive into the coding tutorial.

In audio deep learning, spectrograms play an important role: to connect audio files and deep learning models. Front-end tools such as librosa and nnAudio convert audio waveforms (time-domain) to spectrograms (time-frequency-domain), which can basically be processed in a similar way as images are processed by the model.

In a traditional setup, we would first extract spectrograms, save them to our computer, and then load these images into the model. This is slow, requires disk space, and makes it hard to tune spectrogram features to the task at hand. nnAudio solves these issues by calculating spectrograms on-the-fly as part of the neural network.

nnAudio can calculate different types of spectrograms such as short-time Fourier transform (STFT), Mel-spectrogram, and onstant-Q transform (CQT) by leveraging PyTorch and GPU processing. Processing audio on the GPU shortens the computation time by up to 100x.

--

--

Dorien Herremans

Associate Professor at Singapore University of Technology and Design. Lead of the Audio, Music, and AI Lab (AMAAI)