Building a better glossary with the help of Science

A quick introduction to terminology extraction, based on free tools

We just spent the first few hours of work on our new project and we have the basic structure and the necessary information. Now it’s time to break ground by building a glossary.

With over fifty thousand words of in-game, nailing down all the core terminology might require a lot of time. Sure, the best way to do it is reading the whole text and run an in-depth analysis and philological research. But, even if we had the time and budget for that, why not start with something fast and simple and then grow from there?

The usual approach would be defining a few key terms and then proceed with the translation, adding frequently found terms as we encounter them. However, this approach has several issues.

For one, since we work as a team and split the text among us, the perceived frequency of terms can be rather off: translator A may find a frequent term and mark it as such, translator B has generically rendered it as a casual term since he found only a couple of instances; this means having to go through the translation all over again and harmonize terms. A more cautious approach, on the other hand, leads to huge glossaries of seldom-used terms, where translators spend valuable time debating over the exact nuance of a translation only to find out it never shows up again in the following 50k words.

This is where science comes to our aid. The Open University of Catalonia has developed a tool for lexical and terminological analysis which can be very useful: Lexterm, a "Lexical Extractor for Terminology and Translation" (direct download link).

It’s a free, open-source software, with a bare-bones interface and almost raw input and output, aimed at scholarly research but more than useful to our ends. What Lexterm does is take a corpus file , i.e. a text file containing all the text to be translated, extract the most frequent sequences of n words, or n-grams , possibly including unigrams (single words), and sort them by frequency.

This allows us to have an objective, concrete view of the most frequent terms and phrases, from which we can build a preliminary but significant glossary.

Actually, Lexterm does much more than that, but that goes well beyond the scope of game localization. We are essentially using a Swiss army knife only as a screwdriver, but it’s a very handy screwdriver. Let’s try it with the in-game text for Venetica, shall we? Assume we have copied everything in a text file, we load the monolingual corpus…

Loading the corpus for the analysis Loading the corpus for the analysis

Now to configure the n-grams. We aim to include significant words or phrases to the maximum extent allowed by the software (10-grams). We set the minimum frequency at 3 since a lower number of hits is usually not very significant, and we use the stop-words for English provided with the software (in order not to have a list entirely populated with expressions like "do not", "is the", and "will be").

Filtering out stop words

The list we generated can now be exported and arranged on a spreadsheet. Of course, it will need some weeding out. For instance, notice how "A consumable item", "A consumable" and "consumable item" appear with the same frequency: the software does not differentiate between a longer expression and its shorter components.

Building the term list

Other items on the list require other types of tweaking. A phrase like "Net of the Mask", for instance, obviously needs context: is it a gladiator’s net, a fancy name for a cloak, or something else entirely?

Just because we have a statistical analyzer, it doesn’t mean we can’t do things the old fashioned way: looking it up in the source text. There it is! "The Net of the Mask is a union of literate, and mostly rich, Venetians." So "Net" is a fancy name for a secret society, which we decided to translate as "Loggia", after the Lodges of Freemasons.

"Find Sophistos"… How about proper names? Well, our trusty style guide says the game will only be subtitled and not dubbed, which means we shouldn’t adapt the names of characters.

A quick check with the developer confirms as much. Had this been any other game, the localization of names would have been open to debate, in order to propose to the developers (who have the last word about such a vital subject) an agreed opinion based on solid grounds and considerations.

Let’s see, what else is there? "Inner City", "dervish assassin": straightforward. "Copper ore": is it minerals or raw copper? And where can we add some "Venetian feel" to the translation? "Arsenal district" - we can translate "district" with the specific Venetian term "sestiere" instead of the standard "quartiere"; also, "City Council", can be rendered as "Maggior consiglio", which is the historical council of the Most Serene Republic of Venice. And this means we should translate "Eternal Council" as "Eterno consiglio" since the two terms echo each other…

After a brief run-though of terms and maybe a one-hour brainstorm over Skype, the statistical glossary is completed and can serve as the terminology backbone for the project.

It’s not necessarily complete: something might have been overlooked or miscalculated through the use of stop-words, or because it appears in slightly different forms (in Venetica, "back streets" and "darkstreets" referred to the same location); further addition to the corpus (more handoffs of source text) can change frequency values; a word or expression weeded out as common or straightforward might need a consistent translation, or be found in different contexts needing different translations, and so on.

Brainstorming translations

However, our statistical glossary can now serve as a second-tier base on which to build our translation: it has a finer granularity than the style guide, but is equally final and "set in stone" (barring major mistranslations or catastrophically wrong assumptions) and will allow us to proceed, at last, with translating the text.

Update (September 2014) You can see the content of this and previous posts inside our "Joe Freelancer VS the Mammoth Game Translation" presentation for Localization World 2011.

We don’t really use these manual techniques today, as they are fully automated within memoQ, but I think this can still be useful for understanding what is really going on behind the scenes (and if you need to get a project off the ground immediately and without any budget)

Alain Dellepiane

Alain Dellepiane @gloc247 22 May 2013
Alain is the founder of team GLOC. Want to read more about localization? You should probably try this blog's Best of, which has a dozen of the best articles ready to read. (View all posts by Alain ➜)

Stay updated.

Receive a monthly email with the best game localization news, papers and tools.

Check out the latest issue. Free of charge, zero spam, unsubscribe any time.