Pelcra

Polish & English
Language Corpora
for Research
& Applications

Spelling and NUmbers Voice database

 

Description

SNUV (Spelling and NUmbers Voice database) is a spelling and number and recognition speech database containing over 220 hours of recordings of Polish speakers reading numbers and spelling words. 210 different participants were paid to produce a sample of their speech through an online spoken data collection platform. Written representation of the recordings is provided with the original sound files. The envisaged application of this resource is to enable the creation of automatic speech recognition (ASR) tools that allow users to spell out words and numbers to be recognized. SNUV has been released under a CC-BY license and can be used for commercial purposes free of charge.

Download the corpus

Download sample

Download database with additional information

 

Statistics

The corpus contains records of 210 speakers: 90 males and 120 females, aged between 11 and 69. The total wordcount of the corpus is 704625. The utterances are recorded as single channel, 22050 Hz, 16 bit *.wav files.

 

Creation

The corpus was created using a crowdsourcing approach. An external company named VoiceLab was commissioned to create a Web-based application for recording voice data and to find participants. Each participant was paid depending on the total length of provided recordings, up to 100 PLN (25 EUR). VoiceLab also processed the recordings in order to provide appropriate quality.