Yorai's Page - Internationalization

Internationalization

Saturday, June 05, 2004 01:36 AM

Roy tries to post in Hebrew. He says, and I quote: "???? ?????. ?????? ?????? ????? ???."

No, there's nothing wrong with your browser, and there's nothing wrong with my post. Hebrew support is just very difficult to implement. In fact, support for any language other than English can be difficult to implement, but Hebrew, Arabic, and several other languages make things even harder.

If you're developing any kind of application aimed at international users ("international user" meaning "anyone that doesn't live in the US, uses a foreign language, or even types words like 'résumé'"), be it a Windows program or a web page, you need to know some things about internationalization. Some of these things are cultural (for example, images and icons may have different meanings - sometimes even offensive meanings - in other cultures), and I won't discuss those here. You may need to consult with experts about such things. The technical aspects, however, are within your control, so that's what we'll cover.

First, some terminology. The two basic concepts in international applications are "internationalization" and "localization". Internationalization (affectionately known as "i18n") means preparing your application to accept multiple locales (a locale is a collective name for the user's environment, including the language, date, time, and number formats, and other culture-specific settings). This includes things like displaying numbers in a way the user can read them, accepting input in a variety of formats, and allowing users to enter data in their own language, when appropriate.

Localization means actually translating your application to a specific locale. This includes things like translating every text displayed by your application (including menus, toolbars, error messages and so on), making sure visual elements such as images and icons match the user's culture, and re-organizing the visual layout of your application to match the display language.

In this post, I'll focus on internationalization, and - since this was inspired by an attempt to post to a blog - on text. I'll also discuss some of the specific issues related to Hebrew and other similar scripts.

International Text

The main issue with supporting international text is that text has meaning and context, and proper handling of text requires consideration of these attributes. When I say "meaning" and "context", I'm not referring to the thoughts and ideas that might be expressed by the text, but to the way a digitized text stream should be interpreted. For example, English text contains both lower and upper case characters. When displaying text, programs should take this attribute into account. Sorting and searching, however, often require treating lower as upper case characters as identical, or at least very similar.

I won't go into the history of digitized text storage, or into the details of character sets, code pages, or even Unicode. You can find plenty of information elsewhere. For a quick primer about these things, take a look at this article by Joel Spolsky's. Joel makes the correct point that for a computer to handle text properly, it needs to know things about the text. In modern systems, such as Windows or the Web, that information is called "encoding". In regards to text, encoding means the collection of information a computer needs to know about how text is stored, such as what binary code is used for each character, how many bits of storage are required for a character, and how text strings should be sorted.

Platforms and Applications

If you want your code to support international text, the bare minimum you have to do is make it aware of text encoding. The level of support for encoding depends on your application. Simple applications could probably get away with just supporting the user's locale, relying on the operating system to handle text using the default settings. Platforms, on the other hand, must take encoding into consideration or they, and any application that runs on them, will be useless to international users.

Fortunately, most platforms today already consider text encoding. Let's consider Roy's blog, and see why his test post failed. Roy's blog runs on .Text, an increasingly popular blog engine written in C# and running on ASP.NET. ASP.NET is part of the .NET Framework, runs over IIS, and uses Microsoft SQL Server as its database back end. The framework, IIS, and SQL Server run on Microsoft Windows. .Text provides a web interface (that is, a page displayed by a web browser) to enter new blog posts.

All the Microsoft server products mentioned in the previous paragraph are platforms, and Microsoft made sure they all fully support international text. They all store text internally using a Unicode encoding, and all of them can import and export text in a variety of other encodings. The problem is that for internationalization to work, every component in the chain must support it. If only one fails, everything fails. In this case, it was .Text.

If you look at the HTML typically generated by .Text, you may notice something conspicuously missing. The <HEAD> section doesn't contain any encoding information. Web page encoding is stated by including a tag such as the following one in the <HEAD> section of a page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

This tag (taken from the page you're now reading) tells the browser the page contains HTML and uses the Unicode UTF-8 encoding. Since .Text apparently doesn't care about encoding, it omits the tag, and the browser tries to guess the best encoding for the page. In order to post a message in Hebrew, the following chain of events must take place:

The user types in Hebrew text in the browser.
The browser posts the text to IIS, passing along the encoding information it thinks is correct.
IIS asks Windows to convert the text to Unicode, based on the requested encoding. The success of this may depend on Hebrew being supported by the specific installation of Windows.
IIS passes the request to ASP.NET.
ASP.NET passes the request to .Text.
.Text passes the data to SQL Server, specifying whatever encoding it thinks is appropriate.
SQL Server stores the data as Unicode.

.Text is more than an application: it's a platform for running blogs. As such, it should probably support internationalization. By ignoring text encoding, it renders itself useless for international users. Fortunately, since every other link in the chain supports internationalization, this should be fairly easy to fix.

Hebrew and Complex Scripts

Selecting the proper encoding for text, marking it as using that encoding, and handling it based on that information using standard system services are all the actions you need to take to properly support most languages. Certain languages, however, require a little more work. Microsoft refers to such languages as "complex scripts", and provides a special API for handling them. The API, called "Uniscribe", is supported in Windows 2000 and later. It is also included with Internet Explorer 5.0 and later. Hebrew, Arabic, and Thai are considered complex scripts.

You don't have to use the Uniscribe API to support complex scripts, but you do have to take certain things into consideration.

Alignment

Complex scripts may be aligned to the right, and users reading or writing complex scripts expect applications to support right-alignment of text.

Reading order

Even more important than alignment is the reading order. Right-to-left languages, such as Hebrew and Arabic, are actually bidirectional scripts. While text is read from right to left, numbers, for example, are read from left to right. Certain characters (mostly punctuation marks) are considered language-neutral, and are placed according to the current reading order, while others (such as English characters) are language specific and control the reading order. For example, the string "abc." would appear as ".abc" in RTL reading order. Other characters may change when displayed in RTL reading order. For example, parenthesis are reversed. Bidirectional support requires handling both the logical order of a string (as entered using the keyboard) and the visual layout, which may be very different. It should also handle things like caret movement and hit testing.

For web pages, alignment and reading order can be controlled by the DIR attribute, which you can apply to almost any tag. Applying the attribute to the <HTML> or <BODY> tags controls the direction of the page (for example, <HTML DIR="RTL"> causes the entire page to be right-aligned, and even moves the vertical scroll bar to the left side of an Internet Explorer window). Applying the attribute to a <P> tag controls the direction of a single paragraph.

Special Considerations

Certain RTL languages, like Arabic, require contextual shaping, which means displaying a different glyph for a character depending on its position and the characters around it.

Arabic script is heavily based on ligatures. A series of characters are joined into a single glyph. Arabic and Hebrew also support diacritics, which are placed directly over a character glyph.

Thai, another RTL language, has complex rules for word breaks and text justifications. A text entry control must be aware of these rules. Thai support must also prevent certain character combinations.

Conclusion

If you're developing an application and expect it to reach the hands of international users, you must take certain things into account. Ignoring text encoding may even render your application useless for such users. In today's world, where global distribution has become the default, you may not be able to afford that.

וזה לא כל כך קשה לכתוב בעברית, אם יש לך כלים מתאימים.