About multilingual parallel corpus of translations

Multilingual parallel corpus of translations is based on the EU Commission data.

The multilingual corpus contains EU acts in 22 EU official languages - however, all the texts in the corpus have not been translated into all languages and therefore the number of hits varies with different languages. Most of the texts are in English, which was the source language in most cases.

The users of this corpus should be aware that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.

The multilingual corpus is especially useful for translators.

Currently the corpus contains about 98 million words in 22 languages (all data have not been included yet); language distribution can be seen from the statistics.

Searching the corpus

Enter the search word(s) in the input field. Select one source language and one or more target languages (you can select several target languages by holding down the Controlor Ctrl key while clicking the mouse.). You can limit the search by specifying a full or partial Celex number.
The output can be limited to terminology or corpus data - or it can contain both data. Corpus output can be monolingual (KWIC - KeyWord In Context) or multilingual.

When making a search, the following wildcards can be used:

_ . ? (underline, dot or question mark) can substitute any single character; for instance, if you want to get all the hits containing the word organisation and organization with a single query, then the term organi_ation (or organi.ation or organi?ation) should be entered into the input field.

% * (percent sign or asterisk) can substitute any number of characters; this can be useful for searching multi-word terms in the terminology database. E.g. if you would like to find terms containing the word "company", you will get the following output depending on the search query:

search query	output
`company`	company
`company*`	terms that start with "company" (besides company also company accounts, company area, etc.)
`*company`	terms that end with "company" (besides company also accompany, acquired company, acquiring company, etc. (if you want to eliminate terms containing "accompany", put space after the asterisk))
`company`	terms that contain "company" (besides the above-mentioned terms also abusive company behaviour, etc.)

When searching the corpus, all multi-word terms will be shown if you enter company as the search term. Asterisk is useful if you want to find variable multiple-word terms, e.g. if you want to find hits containing additional words between illicit and drug, then the search query should be written as illicit%drug (or illicit*drug); besides the usual word combinations with illicit drug, the terms such as illicit manufacture of synthetic drugs, illicit manufacture of narcotic drugs, etc. are obtained, too.

The corpus was last updated in July 2008.

Please send any comments regarding the corpus to the author of the program