Learning Disentangled Speech Representations

This work was presented as a poster at the New in ML workshop at the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023).

Why Disentanglement Matters?

Disentanglement in speech representation learning allows models to isolate and control different factors within audio data, such as speaker identity, accent, emotion, and style. By separating these factors, models can achieve more precise and flexible synthetic speech generation. Here’s why disentanglement is crucial:

SynSpeech Dataset: Disentangling speaker identity, content, and speaking style enables us to synthesize more tailored and controllable speech outputs, useful in applications such as personalized content creation or adaptive learning systems.
Chimpanzee Panthoot Vocalizations: Disentangling individual vocal characteristics in chimpanzee calls aids in comparative bioacoustics, providing insights into communication structures that may share evolutionary parallels with human speech.

SynSpeech Dataset Versions

SynSpeech is a synthetic speech dataset (at 16 kHz sammpling rate mono-channel) created for standardized evaluation of disentangled speech representation learning, with controlled variations in speaker identity, spoken content, and speaking style. It includes 184,560 total utterances generated with neural text-to-speech models, annotated with speaker identity, text, gender, and speaking style (default, friendly, sad, whispering). By controlling individual factors, SynSpeech supports isolated evaluations of model performance on disentanglement tasks.

The datasets are hosted on Figshare and are accessible through this project link: Neural Speech Synthesis for Disentangled Representation Learning.

Dataset Versions and Downloads

Version	Speakers	Contents	Styles	Total	Download
Small	50	500	1	25,000	Download (4.87 GB)
Medium	25	500	4	50,000	Download (10.68 GB)
Large	249	110	4	109,560	Download Part 1 (12.02 GB) Download Part 2 (9.84 GB)

Generated Speech Samples

Here we present audio samples for male and female speakers across different speaking styles.

Speaker	Default	Friendly	Sad	Whispering
Male
Female

Speech Representation Learning with the Small Dataset

Demonstrates speech representation learning using the RAVE VAE-based architecture with adversarial fine-tuning, showcasing original, reconstructed, and generated samples (all resampled to 44.1 kHz to enable high-fidelity synthesis prior to training).