Pelcra

Polish & English
Language Corpora
for Research
& Applications

PELCRA Word Aligned Corpora

Word-aligned corpora are an important resource in multilingual language processing. However, few such resources are available for Polish and their creation is impeded by the high processing power and time requirements involved. The PELCRA team undertook to create a set of word-aligned corpora as a part of the META-NET/CESAR project, using the existing PELCRA parallel corpora and statistical word-alignment software (GIZA). The corpora shall be made available in XML format designed for compliance with TEI P5 standards.

Download.

Download database dump.

Like its source PELCRA parallel corpus collection, the PELCRA word aligned corpus collection  is split into four parts, varying by license, linguality and alignment procedure.

Like its source PELCRA parallel corpus collection, the PELCRA word aligned corpus collection  is split into four parts, varying by license, linguality and alignment procedure.

Corpora Linguality  License  Alignment level  Alignment type 
CORDIS, RAPID, JRC-Acquis English-Polish  CC-BY word, sentence statistical
Academia Polish-English  CC-BY-NC word, sentence statistical
CORDIS, ESO, EuroParl, RAPID Polish-English CC-BY word, sentence statistical
OSW Polish-English CC-BY-NC word, sentence statistical

 

Detailed statistics for the delivered resources:

 

Corpus Source language Target language Texts Source words Target words Alignment level Alignment type
CORDIS English Polish 10 268 3 540 445 3 254 917 word, sentence statistical
JRC-Acquis English Polish 23 319 32 447 298 28 571 342 word, sentence statistical
RAPID English Polish 4740 4 149 470 3 793 66 word, sentence statistical