Giulia Locale Project
This is the wiki page for the new Giulia locale project.
We have a discussion mailing list, which you can subscribe to at http://mail.gnome.org/mailman/listinfo/locale-list. The archives are at http://mail.gnome.org/archives/locale-list/.
For the original announcement, which is something like a manifesto, see http://mail.gnome.org/archives/locale-list/2005-August/msg00000.html.
Basic Reading
We will be mostly depending on data from the CLDR project, for which we recommend reading the Unicode Technical Standard #35, Locale Data Markup Language (LDML).
But we can't throw away the data from glibc either, since many applications in GNU/Linux already rely on it. One of the major problems to tackle is how to resolve differences in data sources. For some example of differences in locale data, see CLDR's comparison charts. At the same time, we will need to let the end user customize things, since the data in CLDR is not free from bugs either. Even if it was bug-free, the user may have different preferences for any reason, including personal.
Targetted applications
We are targetting as wide as we can, but from what we have already seen, some free software projects may not like some of the restrictions imposed on us by us not wanting to rewrite everything from scratch.
Currently, our targetted applications include:
- The GNOME desktop (clock applet, evolution, nautilus, epiphany, ...)
- Web servers (ability to use a per-session locale, in Apache + PHP)
Then we would try for:
- Mozilla (at least on GNU/Linux)
- KDE
We can then aim for world domination...
License
We have not finished the discussion about the license yet, but it will be either LGPL or something with less restrictions.
LGPL seems to be the most likely, since there is a lot of code to be borrowed from GNU gettext and GNU C library. There is also a lot of code available in ICU, whose license is fortunately LGPL compatible.
We don't much about the Perl license and the restrictions it may impose on us, but there is also some code useful to us available in the Perl interpreter itself.
Language and technical details
While we love the high level languages, it seems that we should remain with good old C, at least in some of the core parts. In words of Bruno Haible: "the program that creates the binary locale data files from the XML files must be written in portable C, with as little dependencies as possible. Not in Perl, not in Java, no C99, no <stdbool.h>, just plain old C and libxml2." (Actually, <stdbool.h> is possible through the corresponding gnulib module.)
In order to be desirable to more applications, which include web applications which are specifically targetted by CLDR, we should very probably avoid using glib, something which would have been very desirable otherwise.
We will probably use higher level languages in a high level graphical application used by end users to customize their settings. That will probably be done in Python.
We would need an mmap-able file format, because the parsing of a locale description is very computation intensive: For glibc locales, it needs between 1 second and 10 seconds per locale, depending on the speed of your machine. For the CLDR locales, it is likely to be more, given that one needs an XML parser, and even a complex one like libxml2.
Terminology
CLDR: It is the acronym for Common Locale Data Repository Project, one of the important bases for the locale project.
cooker, cooking, etc: Whichever way we do the thing, we will need some component that converts a set of XML files to a binary format that could be easily accessed in memory. That thing is called the cooker, and the process is called cooking.
the spec and LDML: We usuallly refer to the Unicode Technical Standard #35, Locale Data Markup Language (LDML) as the spec or LDML.
Required components
This is a list of components needed to implement CLDR properly. There are also several small algorithms specified in the text of the specification (like one for fallback localization of timezone names), which should be easy to implement as soon as we decode the spec, which may require looking at ICU code.
Unicode Sets
A Unicode Set is something like a regular expression, used in the LDML spec for defining sets of "characters". But characters here don't necessarily mean Unicode characters. They may be things which are considered characters (i.e., elements of writing system) in some locales. For example, the Latin Serbian exemplar set is this:
[a-p r-v z đ ć č ž š {lj} {nj} {dž}]
Where "lj", "nj", and "dž" are actually multi-character Unicode strings.
Unicode sets are used in exemplar sets (list of characters used in each locale), formatting currencies, and collation.
To implement this properly, you need to support some kind of regular expression that will require having almost all of the information from the Unicode Character Database at hand when parsing CLDR files. For example, a Unicode set may be defined as [:age=3.2:] which is all the characters introduced in Unicode 3.2, or [:jg=Beh:] which are all Arabic letters in the same joining group as of Beh.
Also, we would need an index from Unicode characters names to their codes, since the Unicode Sets also allow syntax like \N{EURO SIGN} to refer to U+20AC. This already exists as part of GNU gettext (libuniname).
We should probably start small only supporting things like [a-p r-v z đ ć č ž š {lj} {nj} {dž}] and then continue to more fancy things.
Number Format Patterns
[http://www.unicode.org/reports/tr35/#Number_Format_Patterns ] are used in formatting numbers and currencies, and seem to come from Java. These include features like rounding to the nearest 0.65, involving half-even rounding.
They are not very clean to implement, but we could probably borrow a lot of code from ICU for these. A parser for a good portion of these is already in GNU gettext, file format-java.c. But the library code that interprets these format strings needs to be written.
Unicode Collation Algorithm
Unicode Collation Algorithm (UCA), its default data file (DUCET) and related algorithms are used in ICU customizations needed to support anything about collation. Some of the customization requirements are not understandable and are really references to ICU behavior, so we probably need to go through ICU code a lot for this.
For collation to work according to the UCA, we also need to support normalization (perhaps only to FCD, some loose NFC or NFD which makes equivalent strings treated equally by UCA).
It is also noteworthy that having a single latest version of the DUCET would not suffice, since collation rules may require a certain version of DUCET for collation in a <version> tag. In order to do this properly, that is according to a certain version of Unicode and UCA (as asked), we may also need to have older copies of Unicode data, but I'm not really sure about that.
Alternate calendar handling code
To handle any of the alternate calendars, we need code to support date conversion and related tasks for different calendars. My communications with the authors of the book Calendrical Calculations, which is the reference for CLDR (and the best book available on the matter), in trying to convince them to release some of their code/algorithms under a free software license (or allow work based on them to be released as free software) has been unsuccessful. They mentioned an unfruitful discussion with RMS about this, but their main point was that they get licensing fees for the usage of algorithms in proprietary software, which would decrease if someone release the code as free software.
Some of the authors' older code is available in Emacs under the GPL. I don't know if we may be able to get the permission of FSF to release derivate work base on that under LGPL, but we may try anyway.
The quality of the alternate calendar code available otherwise as free software is usually poor.
Locale-dependent casing algorithms
Locale-dependent algorithms for for case-insensitive comparison, uppercasing, lowercasing, and titlecasing are needed in many places where something is case insensitive. For example, exemplar sets are case insensitive, and while [a-z] is specified in the file, an API should return an equivalent of [A-Za-z] when asked for the data.
The casing algorithms and data are different for Turkish, Azerbaijani (as written in Latin), and Lithuanian. I believe there is code taking care of this in glib.
Titlecasing is needed in some languages like Czech and Russian, when default casing of some names is different when used in running text and when used in lists in the GUI.
XPath
This is available in libxml2.
POSIX regular expression handling
This is needed for handling "yesexpr" and "noexpr". We could probably copy the code directly from glibc.
Other subproject
CLDR is not all that is needed for a nice localized system. There are several things the CLDR project has not tackled yet, or is not planning to tackle for various reasons. This project may implement some such systems.
Examples include: