Intlize

What is intlize?

Intlize is a tool to add internationalization to your application.

There are such tools available; why another one?

Most notably, the GNU gettext suite and the catgets suite are used for internationalization. Both have advantages and disadvantages of their own. Intlize intends to combine gettext's and catgets' advantages. It can optionally produce its own straight forward format, optimized for both speed and size.

Gettext

The gettext team assumes that a programmer wants to write code, and not care much about internationalization. As a result, the impact on the source code is minimized. All the programmer has to do is marking which string is translatable. Marking a string is easy, e. g. replace puts("Write something") with puts( _("Write something" ) ). Please refer to the gettext manual for details.

The string encapsulated in _() in the above example serves three purposes:

it is the string that the translator translates
it is used as the default string returned if no translation for this string is found
it provides a handle for finding the translated string

This implies that this string appears in the executable and in every translation file, which is not space economical. The messages are searched by string comparison, which is not the fastest way. On the other hand, out-of-date translations work without modification, only with some untranslated messages.

Gettexts libintl is LGPLed, which may limit its use in non-free programs, or add a dependency in that case.

Conclusion

Gettext is very convenient for both programmer and translator. Speed and size could be improved. The licence may limit its useability in some cases.

Catgets

Marking a string with catgets is something like catgets(catd, 12, 34, "Write something") . This is already clumsy compared to gettext, but things are even worse. The indices - 12 and 34 in the above example - have to be unique throughout the package being internationalized. It is the programmers task to ensure that.

However, it is very well possible to use the catgets interface, and many applications do. Catgets made its way into libc, thus not establishing additional dependencies in C/C++ programs. Looking up two indices may be faster than looking up a string, but this is hard to tell. The translatable string does not serve as a handle to the translated string, so it does not appear in the translation files, reducing their size.

The binary format of the translations that the lookup is performed on is not defined. It may vary from compiler to compiler, even from version to version of the same compiler. Fortunately the generator of those binary files is delivered with the compiler.

Conclusion

Even though the catgets interface is less convenient than gettext, it is part of the standard libraries of the compiler. The translation files tend to be smaller than the ones used by gettext. The format of the binary translations is not defined.

Intlize

When it comes to marking strings, intlize stays as close as possible to gettext. The synthax is actually _("Write something", 0). The numerical value is needed, because intlize uses an index for translation table lookup. A string marked this way is recognized by the gettext tool xgettext. Human readable translation files in intlize are therefore the same as in gettext. This may ease the work for the translator.

The index value is not allways 0, but has to be uniqe for every different string. It is intlizes task to ensure that. When adding a marked string, the programmer allway writes 0, which is an invalid index. Intlize alters this index as needed.

Intlize and catgets

This is an obvious combination. Strings are marked the intlize way, beeing convenient to the programmer. These strings are extracted with xgettext, producing the well known "po-files". Tools to write translations with such file format are widely available, in a large variety. Intlize then produces a catgets catalog from the po-file. This catalog is still in human readable form, since catgets binary format is not defined. This binary format is produced at compile time with the catgets tool gencat.

Intlize native

Intlize can optionally produce output in its own binary format. In this mode, the translatable string does not appear in the application or any binary translation file. Intlize produces an additional translation file with name "C", that does not contain a translation, but the translatable strings. This file is used whenever no suitable translation file is found.

Intlizes binary translation file format has minimal overhead. They are basically an array of C-strings. The index given to the marked strings in the souces are contiguous and can directly map into that array. It is hard to imagine a faster translation string lookup.

Intlize binary format

offset 0	8 byte	string	"intlize\0"	file magic
offset 8	1 byte	char		charcter encoding
offset 9	variable	string		version, zero terminated
	2 byte	string	"?\0"	message # 0
	variable	string		message # 1
	variable	string		message # 2
	variable	string		...
	variable	string		message # n

Character encoding

The character encoding must not be 0.

1	ISO-8859-1
2	ISO-8859-2
...
16	ISO-8859-16
17	KOI8-R
18	KOI8-T
19	KOI8-U
32	UTF-8

Version

An arbitrary zero terminated string to detect whether the versions of the package and the translation file match.