Polish & English
Language Corpora
for Research
& Applications

Tools and Resources

This page lists some of the most important language technology tools and resources provided by the PELCRA team.

Corpus search engines for Polish and English corpora

The PELCRA Search Engine for the National Corpus of Polish was developed as part of the National Corpus of Polish Project. You can access the original search engine here.
A new version of the engine which supports morphosyntactic features is available here.
There is also an instance of this new search engine for the BNC corpus available here.
We also recommend Monco as a monitor corpus of English and Polish.

HASK

HASK collocation dictionaries are dictionaries of frequent word combinations automatically generated from reference English and Polish language corpora.

Developed by the PELCRA group at the University of Łódź, HASK dictionaries are essentially phraseological databases meant to be used by linguists, language teachers, lexicographers, language materials developers, translators and other language professionals and casual dictionary users.

Click here to access HASK for BNC
HASK for NKJP is available here.

Parallel corpora

Paralela is a large corpus search engine developed in the CLARIN-PL project for a large collection of annotated Polish-English parallel texts. Mos of the texts indexed in Paralela can be extracted programmatically through its API.

You can access Paralela here.

Its REST API is briefly described here.

Spokes

Corpora of spontaneous conversational speech are an important source of primary data, not only for linguists, but increasingly also for research in other areas of social sciences and humanities. Spokes is a multimedia search engine for a unique corpus of casual conversational Polish. It is currently being developed by the PELCRA team as part of the Polish CLARIN Infrastructure. By providing tools for data mining and visualization we hope to make this data more easily accessible to researchers interested in exploring samples of naturally-occurring conversational language.

Spokes for Polish conversational data (ca. 2.3 million words) is available here.
A separate instance of Spokes is available for the BNC Data at this address.

The full contents of Spokes can be extracted programmatically using a dedicated web service. For details, see the documentation available on the help page of Spokes.

Offline Corpora

There are also a number of more recent corpora listed below which can be downloaded with recordings.

The following paper should be cited fulfill the CC attribution condition of the license for these resources:

Pęzik, Piotr. “Increasing the Accessibility of Time-Aligned Speech Corpora with Spokes Mix,” 4297–4300. Miyazaki, Japan, 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/888.pdf.

Detailed documentation of PELCRA Spoken Offline Corpora and download links can be found here.

PELCRA_EMO	A corpus of focused interviews (people reflecting upon their emotions).
PELCRA_LUZ	A corpus of open interviews.
PELCRA_PARL	Samples of spoken parliamentary data.
PELCRA_EMI	A corpus of Polish emmigrants to Scotland.
PELCRA_YT	Samples of Polish YouTubers' videos.
MOWA_MIAST	A corpus of Polish conversations recorded in the 1980s.

DiaBiz

DiaBiz -- a corpus of over 3700 phon-based customer support line dialogs. More information is available here.

PLEC

The Polish-English Learner Corpus was compiled to investigate selected aspects of the English used by Polish speakers by applying a corpus linguistics methodology and to disseminate the results among teachers, authors of handbooks and in the Academia. The corpus contains a 3 million word written and spoken (200,000 words) Polish learner English. The corpus has been annotated with selected errors and linguistic phenomena typical for Polish speakers. The contents of the corpus can be explored at http://pelcra.pl/plec/.

SNUV

SNUV is a freely available Polish speech database. It contains approximately 220 hours of recordings of people reading numbers and spelling words. More information about this database is available here.

PoS Tagger for Polish

A part-of-speech tagger for Polish trained on the PolEval datasets (an elaborated version of the NKJP 1M corpus) is available at this address: http://clarin.pelcra.pl/tagger-app/.

It can be queried through REST requests as shown in the following example:

http://clarin.pelcra.pl/apt_pl/?sentences=["Ala lubi kota.","Jurek ma worek."].

WiKNN Classifier

The Wikipedia K-nearest neighbors classifier is available here as a demo application and a web service. It assigns labels from the Wikipedia taxonomy of article topics to Polish or English texts submitted to it. For example, given the text of this article from NYT: Patriots Mount a Comeback for the Ages to Win a Fifth Super Bowl it will produce a set of Wikipedia labels such as New_England_Patriots_seasons, Super_Bowl or New_England_Patriots with their respective relevance scores. These labels can be then located in the Wikipedia taxonomy of categories.

Polish & English Language Corpora for Research & Applications