Polish & English
Language Corpora
for Research
& Applications

PELCRA Word Aligned Corpora

Word-aligned corpora are an important resource in multilingual language processing. However, few such resources are available for Polish and their creation is impeded by the high processing power and time requirements involved. The PELCRA team undertook to create a set of word-aligned corpora as a part of the META-NET/CESAR project, using the existing PELCRA parallel corpora and statistical word-alignment software (GIZA). The corpora shall be made available in XML format designed for compliance with TEI P5 standards.

Download.

Download database dump.

Like its source PELCRA parallel corpus collection, the PELCRA word aligned corpus collection is split into four parts, varying by license, linguality and alignment procedure.

Corpora	Linguality	License	Alignment level	Alignment type
CORDIS, RAPID, JRC-Acquis	English-Polish	CC-BY	word, sentence	statistical
Academia	Polish-English	CC-BY-NC	word, sentence	statistical
CORDIS, ESO, EuroParl, RAPID	Polish-English	CC-BY	word, sentence	statistical
OSW	Polish-English	CC-BY-NC	word, sentence	statistical

Detailed statistics for the delivered resources:

Corpus	Source language	Target language	Texts	Source words	Target words	Alignment level	Alignment type
CORDIS	English	Polish	10 268	3 540 445	3 254 917	word, sentence	statistical
JRC-Acquis	English	Polish	23 319	32 447 298	28 571 342	word, sentence	statistical
RAPID	English	Polish	4740	4 149 470	3 793 66	word, sentence	statistical

Polish & English Language Corpora for Research & Applications

PELCRA Word Aligned Corpora

Polish & English
Language Corpora
for Research
& Applications