Pelcra

Polish & English
Language Corpora
for Research
& Applications

Paralela

Paralela is both a corpus of annotated Polish-English parallel texts and a search engine developed for this collections in the CLARIN-PL project.

Please cite the following paper if you use Paralela for your research:

Pęzik, Piotr. "Exploring Phraseological Equivalence with Paralela." In Polish-Language Parallel Corpora, edited by Ewa Gruszczyńska and Agnieszka Leńko-Szymańska, 67–81. Warsaw: Instytut Lingwistyki Stosowanej UW, 2016.

The web interface of Paralela can be accessed here:

http://paralela.clarin-pl.eu/

The entire Polish-English content of Paralela can be extracted using the following two REST endpoints:

  1. For a list of sources and their sizes (in segments) use this method:
  2. For a list of segments matching a metadata query, specifying criteria such as a certain source name, use the following method:

    The parameters of this request are rather self-explanatory: it retrieves the first 25 segments of "Ogniem i Mieczem". The maximum number of segments which can be retrieved in a single query is currently 50 000, so you may need to page through the results returned for large sources. Note that this method supports the DisMax query syntax. For example, to get a subset of merged segments from "Ogniem i Mieczem", you might formulate the following query:

  3. This Colab script can be used to download the entire contents of Paralela using the abovementioned endpoints.