This page lists some of the most important language technology tools and resources provided by the PELCRA team.
HASK collocation dictionaries are dictionaries of frequent word combinations automatically generated from reference English and Polish language corpora.
Developed by the PELCRA group at the University of Łódź, HASK dictionaries are essentially phraseological databases meant to be used by linguists, language teachers, lexicographers, language materials developers, translators and other language professionals and casual dictionary users.
Paralela is a large corpus search engine developed in the CLARIN-PL project for a large collection of annotated Polish-English parallel texts. Mos of the texts indexed in Paralela can be extracted programmatically through its API.
You can access Paralela here.
Corpora of spontaneous conversational speech are an important source of primary data, not only for linguists, but increasingly also for research in other areas of social sciences and humanities. Spokes is a multimedia search engine for a unique corpus of casual conversational Polish. It is currently being developed by the PELCRA team as part of the Polish CLARIN Infrastructure. By providing tools for data mining and visualization we hope to make this data more easily accessible to researchers interested in exploring samples of naturally-occurring conversational language.
The full contents of Spokes can be extracted programmatically using a dedicated web service. For details, see the documentation available on the help page of Spokes.
The Polish-English Learner Corpus was compiled to investigate selected aspects of the English used by Polish speakers by applying a corpus linguistics methodology and to disseminate the results among teachers, authors of handbooks and in the Academia. The corpus contains a 3 million word written and spoken (200,000 words) Polish learner English. The corpus has been annotated with selected errors and linguistic phenomena typical for Polish speakers. The contents of the corpus can be explored at http://pelcra.pl/plec/.
SNUV is a freely available Polish speech database. It contains approximately 220 hours of recordings of people reading numbers and spelling words. More information about this database is available here.
We have made available a part-of-speech tagger trained on the NKJP 1M corpus and using the Morfeusz morphological dictionary. It can be tested through this online interface or accessed programmatically as a web service.
Here is an example curl call to the tagger:
curl --data "text=Ala ma kota.&tagger=openNLP&tagset=standard&format=JSON&lang=pl" \ http://clarin.pelcra.pl/tools/api/tagger/tag
The JSON response to this request looks like this.
You can also use this URL to call our tagger as a proper REST service. By changing the lang param to "en" you get access to an online instance of the Apache OpenNLP tagger for English.
The Wikipedia K-nearest neighbors classifier is available here as a demo application and a web service. It assigns labels from the Wikipedia taxonomy of article topics to Polish or English texts submitted to it. For example, given the text of this article from NYT: Patriots Mount a Comeback for the Ages to Win a Fifth Super Bowl it will produce a set of Wikipedia labels such as New_England_Patriots_seasons, Super_Bowl or New_England_Patriots with their respective relevance scores. These labels can be then located in the Wikipedia taxonomy of categories.