Most notably, the GNU gettext suite and the catgets suite are used for internationalization. Both have advantages and disadvantages of their own. Intlize intends to combine gettext's and catgets' advantages. It can optionally produce its own straight forward format, optimized for both speed and size.
The gettext team assumes that a programmer wants to write code, and not care much about internationalization. As a result, the impact on
the source code is minimized. All the programmer has to do is marking which string is translatable. Marking a string is easy, e. g. replace
puts("Write something")
with puts( _("Write something"
) ).
Please refer to the gettext manual for details.
The string encapsulated in _()
in the above example serves three purposes:
Gettexts libintl is LGPLed, which may limit its use in non-free programs, or add a dependency in that case.
Marking a string with catgets is something like catgets(catd, 12, 34, "Write something")
. This is already clumsy
compared to gettext, but things are even worse. The indices - 12 and 34 in the above example - have to be unique throughout the package
being internationalized. It is the programmers task to ensure that.
However, it is very well possible to use the catgets interface, and many applications do. Catgets made its way into libc, thus not establishing additional dependencies in C/C++ programs. Looking up two indices may be faster than looking up a string, but this is hard to tell. The translatable string does not serve as a handle to the translated string, so it does not appear in the translation files, reducing their size.
The binary format of the translations that the lookup is performed on is not defined. It may vary from compiler to compiler, even from version to version of the same compiler. Fortunately the generator of those binary files is delivered with the compiler.
Even though the catgets interface is less convenient than gettext, it is part of the standard libraries of the compiler. The translation files tend to be smaller than the ones used by gettext. The format of the binary translations is not defined.
When it comes to marking strings, intlize stays as close as possible to gettext. The synthax is actually _("Write something", 0)
. The numerical value is needed, because intlize uses an index for translation table lookup. A string marked this way is recognized
by the gettext tool xgettext. Human readable translation files in intlize are therefore the same as in gettext. This may ease the work for
the translator.
The index value is not allways 0, but has to be uniqe for every different string. It is intlizes task to ensure that. When adding a marked string, the programmer allway writes 0, which is an invalid index. Intlize alters this index as needed.
This is an obvious combination. Strings are marked the intlize way, beeing convenient to the programmer. These strings are extracted with xgettext, producing the well known "po-files". Tools to write translations with such file format are widely available, in a large variety. Intlize then produces a catgets catalog from the po-file. This catalog is still in human readable form, since catgets binary format is not defined. This binary format is produced at compile time with the catgets tool gencat.
Intlize can optionally produce output in its own binary format. In this mode, the translatable string does not appear in the application or any binary translation file. Intlize produces an additional translation file with name "C", that does not contain a translation, but the translatable strings. This file is used whenever no suitable translation file is found.
Intlizes binary translation file format has minimal overhead. They are basically an array of C-strings. The index given to the marked strings in the souces are contiguous and can directly map into that array. It is hard to imagine a faster translation string lookup.
offset 0 | 8 byte | string | "intlize\0" | file magic |
offset 8 | 1 byte | char | charcter encoding | |
offset 9 | variable | string | version, zero terminated | |
2 byte | string | "?\0" | message # 0 | |
variable | string | message # 1 | ||
variable | string | message # 2 | ||
variable | string | ... | ||
variable | string | message # n |
The character encoding must not be 0.
1 | ISO-8859-1 |
2 | ISO-8859-2 |
... | |
16 | ISO-8859-16 |
17 | KOI8-R |
18 | KOI8-T |
19 | KOI8-U |
32 | UTF-8 |
An arbitrary zero terminated string to detect whether the versions of the package and the translation file match.