Natural language processing

Researchers collect 950,000 hours of open source speech data for EU languages

Researchers collect 950,000 hours of open source speech data for EU languages



summary
Summary

International researchers have compiled MOSEL, a comprehensive open-source speech data collection for the 24 official EU languages. The project aims to advance the development of open AI language models in Europe.

Creating powerful AI language models requires vast amounts of training data. Until now, English-language datasets and proprietary systems from large tech companies have dominated. An international research team wants to change this: With MOSEL (Massive Open-source compliant Speech data for European Languages), they have assembled an extensive collection of open-source speech data for the 24 official languages of the European Union.

The collected data comes from 18 different sources, including projects like CommonVoice, LibriSpeech, and VoxPopuli. It includes both transcribed speech recordings and unlabeled audio data. Particularly valuable are the 505,000 hours of transcribed data.

However, the distribution among languages is very uneven. While over 437,000 hours of labeled data are available for English, languages like Maltese or Irish have only a few hours.

Ad

AI-supported transcription expands database

To improve the data situation for resource-poor languages, the researchers automatically transcribed an additional 441,000 hours of previously unlabeled audio data. They used OpenAI’s Whisper AI model for this purpose.

The team explains that while automatic transcription is not perfect, it allows large amounts of training material to be provided even for languages with little manually transcribed data. The generated transcripts are published under the Creative Commons CC-BY license, which allows free use with attribution.

The challenges of automatic transcription are particularly evident in the case of Maltese. Here, the Whisper model achieved a word error rate of over 80 percent – meaning that on average, four out of five words were incorrectly recognized.

For such languages, much work is still needed – but the automated transcriptions could serve as a starting point for further improvements. The team also plans to collect more data for underrepresented languages.

The entire data collection is freely available on GitHub and is intended to facilitate researchers’ and developers’ access to extensive speech data for European languages.

Recommendation

Researchers collect 950,000 hours of open source speech data for EU languages

Source link