Saturday, October 8, 2011

Concatenation is evil.

While I was a student, I participated in a group project to localize a web application. We had to find our own client and everything. We found a nice client with a cool application, and they made things very straightforward for us. They gave us a list of strings that was easily prepared for translation.

There was one problem, though. String concatenation. There was a lot of it. What is string concatenation? It's when you have two strings and you stick them together. One common example: "You have " + (some number) + " item(s) in your shopping cart." The more there is of this, the more difficult it is to localize.

String concatenation occurs almost exclusively in software and websites. In fact, I can't think of any exceptions. It's really easy for programmers (or developers, or whatever they want to call themselves) to fall into the trap of string concatenation. In the example listed above, the number would be calculated using some code somewhere in the application. The easiest way to put the number into the sentence is concatenation. So why does concatenation pose such a problem to translation and localization?

First, let me briefly mention internal strings vs. external strings. Internal strings are strings that are put directly into the code. This might be in the JavaScript of a website or in the computer code of a program. Depending on the complexity of the application, internal strings are almost impossible to localize. It can be done, but it takes longer and costs more. Why? Because you have to sort through all of the code, decide what is relevant for translation and hope that you don't mess up the code in the meantime.

Most people who want their applications localized know this. If not, they will soon. So they externalize their strings. What does this mean? It means you get some sort of list of (identifier) + (string).

This is why concatenation is a problem. You may find something like this:
string.beginning = "You have "
string.end = " item(s) in your shopping cart."

If you speak another language, you may already see the problem. If not, here's a basic explanation. Different languages have different grammar rules. Sometimes, this affects word order. Some language's grammar rules might force the sentence's structure to be "Your shopping cart contains (number) item(s)." or something like that. Other languages might be different. You could, in theory, translate "You have " as "Your shopping cart contains ", but that has its own problems. For one thing, string concatenation often comes paired with string reuse. "You have " might be used somewhere else in the program. It would also pollute your translation memory (more on that in a different post in the future). The list might also look like this:
string.1 = "You have "
string.2 = "something completely unrelated to the contents of your shopping cart"
...
string.453 = "item(s) in your shopping cart."

In this case, context is completely lost. Sometimes the strings might not even appear in the right order in the list. So how do we avoid problems with string concatenation? The answer is simple: We somehow convince our clients to avoid string concatenation. Avoid it like you would avoid tall trees during a thunderstorm.

Unfortunately, getting rid of string concatenation will also get rid of string reuse. That's a nightmare to localization, too, so it's alright with us to get rid of it, but string reuse is popular among some programmers, because they don't have to type as much, and programming is all about making things more efficient. If this is mentioned, you might want to mention that they might need to write completely different code for each language if they still want to use concatenation. That might get their attention.

There are right ways and less right ways to get rid of concatenation. Some strings, like the one mentioned above, have parts that might change. Here is a really wrong way to do it:
string.1 = "You have 1 item in your shopping cart."
string.2 = "You have 2 items in your shopping cart."
...
string.500 = "You have 500 items in your shopping cart."

I think it's obvious what's wrong with this. More strings = more time and money spent by everybody. And it might not always be possible to calculate every possibility.

Here's a less wrong way to do it:
string.itemsInList = "You have %d(numberOfItems) item(s) in your shopping cart."

The above way is a way to cheat string concatenation. You put some sort of placeholder (and the way to do this will change with each computer language, but almost all of them support it) in the middle to keep the string together, and you insert the value later. There are problems with this. It might cause awkward grammar. It might cause confusion. But it's better than concatenation, and it's better than a case for every alternative, especially when it could be an infinitely long list.

Here's what I recommend, where possible:
string.itemsInList = "Number of items in your list: %d(numberOfItems)"

Put the variable part(s) of the string at the end of the string, with a logically complete phrase before it. Or, and this would be a correct usage of string concatenation:
string.itemsInList = "Number of items in your list: "
Then, in the code, the programmer could concatenate that logically complete string with the number.

As you can see, concatenation can cause more than a minor headache for localization teams. It should be managed or avoided, when possible.

Feel free to share your comments below. This is a new blog, so feel free to include things that you would like addressed. I am by no means an expert on everything in localization, but I'm willing to share what knowledge I have and observations that I have made.

7 comments:

  1. Great post, Dale. It was great for someone like me who understands the basics, but not the nitty gritty details of localization engineering.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. To simplify your software translation workflow, try using the translation and localization management platform https://poeditor.com

    This online localization tool has a very nice and flexible interface that allows l10n team members to easily collaborate on translating the software strings.

    Also, features like API and Translation Memory automate the localization process a lot, while reducing costs.

    ReplyDelete