This work was presented as a poster at the New in ML workshop at the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023).
Disentanglement in speech representation learning allows models to isolate and control different factors within audio data, such as speaker identity, accent, emotion, and style. By separating these factors, models can achieve more precise and flexible synthetic speech generation. Here’s why disentanglement is crucial:
SynSpeech is a synthetic speech dataset (at 16 kHz sammpling rate mono-channel) created for standardized evaluation of disentangled speech representation learning, with controlled variations in speaker identity, spoken content, and speaking style. It includes 184,560 total utterances generated with neural text-to-speech models, annotated with speaker identity, text, gender, and speaking style (default, friendly, sad, whispering). By controlling individual factors, SynSpeech supports isolated evaluations of model performance on disentanglement tasks.
The datasets are hosted on Figshare and are accessible through this project link: Neural Speech Synthesis for Disentangled Representation Learning.
Version | Speakers | Contents | Styles | Total | Download |
---|---|---|---|---|---|
Small | 50 | 500 | 1 | 25,000 | Download (4.87 GB) |
Medium | 25 | 500 | 4 | 50,000 | Download (10.68 GB) |
Large | 249 | 110 | 4 | 109,560 |
Download Part 1 (12.02 GB) Download Part 2 (9.84 GB) |
Here we present audio samples for male and female speakers across different speaking styles.
Speaker | Default | Friendly | Sad | Whispering |
---|---|---|---|---|
Male | ||||
Female |
Demonstrates speech representation learning using the RAVE VAE-based architecture with adversarial fine-tuning, showcasing original, reconstructed, and generated samples (all resampled to 44.1 kHz to enable high-fidelity synthesis prior to training).
Visualization of the original, reconstructed, and generated audio for comparison.
Original | Reconstructed | Generated |
---|---|---|
We Demonstrates speech representation learning, showcasing original, reconstructed, and generated samples.
Visualization of the original, reconstructed, and generated audio for comparison.
Original | Reconstructed | Generated |
---|---|---|
Examines representation learning in chimpanzee vocalizations using similar techniques as with human speech.
Visualization of the original, reconstructed, and generated audio for comparison.
Original | Reconstructed | Generated |
---|---|---|
Further support in the development of this dataset was provided by student assistant Ronja Natascha Lindemann.
Yusuf Brima1,2,*, Ulf Krumnack1, Simone Pika2, Gunther Heidemann1
1Computer Vision, 2Comparative BioCognition, *Corresponding author
Institute of Cognitive Science, Universität Osnabrück, Germany
ybrima@uos.de, krumnack@uos.de, spika@uos.de, gheidema@uos.de