Word-aligned corpora are an important resource in multilingual language processing. However, few such resources are available for Polish and their creation is impeded by the high processing power and time requirements involved. The PELCRA team undertook to create a set of word-aligned corpora as a part of the META-NET/CESAR project, using the existing PELCRA parallel corpora and statistical word-alignment software (GIZA). The corpora shall be made available in XML format designed for compliance with TEI P5 standards.
Like its source PELCRA parallel corpus collection, the PELCRA word aligned corpus collection is split into four parts, varying by license, linguality and alignment procedure.
Like its source PELCRA parallel corpus collection, the PELCRA word aligned corpus collection is split into four parts, varying by license, linguality and alignment procedure.
Corpora |
Linguality |
License |
Alignment level |
Alignment type |
CORDIS, RAPID, JRC-Acquis |
English-Polish |
CC-BY |
word, sentence |
statistical |
Academia |
Polish-English |
CC-BY-NC |
word, sentence |
statistical |
CORDIS, ESO, EuroParl, RAPID |
Polish-English |
CC-BY |
word, sentence |
statistical |
OSW |
Polish-English |
CC-BY-NC |
word, sentence |
statistical |
Detailed statistics for the delivered resources:
Corpus |
Source language |
Target language |
Texts |
Source words |
Target words |
Alignment level |
Alignment type |
CORDIS |
English |
Polish |
10 268 |
3 540 445 |
3 254 917 |
word, sentence |
statistical |
JRC-Acquis |
English |
Polish |
23 319 |
32 447 298 |
28 571 342 |
word, sentence |
statistical |
RAPID |
English |
Polish |
4740 |
4 149 470 |
3 793 66 |
word, sentence |
statistical |