Learning Disentangled Speech Representations

This work was presented as a poster at the New in ML workshop at the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023).

Contents:


Why Disentanglement Matters?

Disentanglement in speech representation learning allows models to isolate and control different factors within audio data, such as speaker identity, accent, emotion, and style. By separating these factors, models can achieve more precise and flexible synthetic speech generation. Here’s why disentanglement is crucial:


SynSpeech Dataset Versions

SynSpeech is a synthetic speech dataset (at 16 kHz sammpling rate mono-channel) created for standardized evaluation of disentangled speech representation learning, with controlled variations in speaker identity, spoken content, and speaking style. It includes 184,560 total utterances generated with neural text-to-speech models, annotated with speaker identity, text, gender, and speaking style (default, friendly, sad, whispering). By controlling individual factors, SynSpeech supports isolated evaluations of model performance on disentanglement tasks.

The datasets are hosted on Figshare and are accessible through this project link: Neural Speech Synthesis for Disentangled Representation Learning.

Dataset Versions and Downloads

Version Speakers Contents Styles Total Download
Small 50 500 1 25,000 Download (4.87 GB)
Medium 25 500 4 50,000 Download (10.68 GB)
Large 249 110 4 109,560 Download Part 1 (12.02 GB)
Download Part 2 (9.84 GB)

Generated Speech Samples

Here we present audio samples for male and female speakers across different speaking styles.

Speaker Default Friendly Sad Whispering
Male
Female

Speech Representation Learning with the Small Dataset

Demonstrates speech representation learning using the RAVE VAE-based architecture with adversarial fine-tuning, showcasing original, reconstructed, and generated samples (all resampled to 44.1 kHz to enable high-fidelity synthesis prior to training).

Spectrogram Comparison

Visualization of the original, reconstructed, and generated audio for comparison.

Spectrogram Comparison: Original, Reconstructed, Generated
Figure 1: Spectrogram Comparison: Original, Reconstructed, and Generated. This figure illustrates the waveform and spectrogram views for the original, reconstructed, and generated audio samples, providing a visual comparison across different processing stages in speech synthesis.
Original Reconstructed Generated

Speech Representation Learning on Medium Dataset

We Demonstrates speech representation learning, showcasing original, reconstructed, and generated samples.

Spectrogram Comparison

Visualization of the original, reconstructed, and generated audio for comparison.

Spectrogram Comparison for Medium-Size Dataset: Original, Reconstructed, Generated
Figure 2: Spectrogram Comparison for Medium-Size Dataset: Original, Reconstructed, and Generated. This figure displays the waveform and spectrogram views for the original, reconstructed, and generated audio samples within the medium-size SynSpeech dataset, offering a visual comparison to analyze performance across different synthesis stages.
Original Reconstructed Generated

Speech Representation Learning on Chimpanzee Panthoot Vocalization Dataset

Examines representation learning in chimpanzee vocalizations using similar techniques as with human speech.

Spectrogram Comparison

Visualization of the original, reconstructed, and generated audio for comparison.

Spectrogram Comparison for Chimpanzee Pant-Hoot Vocalization: Original, Reconstructed, Generated
Figure 3: Spectrogram Comparison for Chimpanzee Pant-Hoot Vocalization: Original, Reconstructed, and Generated. This figure demonstrates the waveform and spectrogram views for chimpanzee pant-hoot vocalizations, comparing the original, reconstructed, and generated audio samples. It highlights the model’s ability to capture and synthesize the complex vocal characteristics of chimpanzee calls.
Original Reconstructed Generated

Acknowledgments

Further support in the development of this dataset was provided by student assistant Ronja Natascha Lindemann.


Author Affiliations

Yusuf Brima1,2,*, Ulf Krumnack1, Simone Pika2, Gunther Heidemann1

1Computer Vision, 2Comparative BioCognition, *Corresponding author

Institute of Cognitive Science, Universität Osnabrück, Germany

ybrima@uos.de, krumnack@uos.de, spika@uos.de, gheidema@uos.de

Link to the paper