tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

icu.rst (22724B)


      1 ###
      2 ICU
      3 ###
      4 
      5 Introduction
      6 ============
      7 
      8 Internationalization (i18n, “i” then 18 letters then “n”) is the process of handling data with respect to a particular locale:
      9 
     10 -  The number 5 representing five US dollars might be formatted as
     11 
     12   -  “$5.00” in American English,
     13   -  “US$5.00” in Canadian English, or
     14   -  “5,00 $US” in French.
     15 
     16 -  A list of people’s names in a phone book would sort
     17 
     18   -  in English alphabetically; but
     19   -  in German, where “ä”/“ö”/“ü” are often interchangeable with “ae”/“oe”/“ue”, alphabetically but with vowels with umlauts treated as their two-vowel counterparts.
     20 
     21 -  The currency whose code is “CHF” might be formatted as
     22 
     23   -  “Swiss Franc” in English, but
     24   -  “franc suisse” in French.
     25 
     26 -  The Unix time 1590803313070 might format as the time string
     27 
     28   -  “9:48:33 PM Eastern Daylight Time” in American English, but
     29   -  “21:48:33 Nordamerikanische Ostküsten-Sommerzeit” in German.
     30 
     31 i18n encompasses far more than this, but you get the basic idea.
     32 
     33 Internationalization in SpiderMonkey and Gecko
     34 ==============================================
     35 
     36 SpiderMonkey implements extensive i18n capabilities through the `ECMAScript Internationalization API <https://tc39.es/ecma402/>`__ and the global ``Intl`` object. Gecko requires i18n capabilities to implement text shaping, sort operations in some contexts, and various other features.
     37 
     38 SpiderMonkey and Gecko use `ICU <http://site.icu-project.org/>`__, Internationalization Components for Unicode, to implement many low-level i18n operations. (Line breaking, implemented instead in ``intl/lwbrk``, is a notable exception.) Gecko and SpiderMonkey also use ICU’s implementations of certain i18n-*adjacent* operations (for example, Unicode normalization).
     39 
     40 ICU date/time formatting functionality requires extensive knowledge of time zone names and when zone transitions occur. The IANA ``tzdata`` database supplies this information.
     41 
     42 A final note of caution: ICU carefully depends upon an exact Unicode version. Other parts of SpiderMonkey and Gecko have separate dependencies on an exact Unicode version. Updates to ICU and related components *must* be synchronized with those updates so that the entirety of SpiderMonkey, and the entirety of Gecko including SpiderMonkey within it, advance to new Unicode versions in lockstep. [#lockstep]_
     43 
     44 .. [#lockstep]
     45   The steps involved in updating Gecko-in-general’s Unicode version, and updating SpiderMonkey’s code dependent on Unicode version, are `documented on WikiMO <https://wiki.mozilla.org/I18n:Updating_Unicode_version>`__.
     46 
     47 Building SpiderMonkey or Gecko with ICU
     48 =======================================
     49 
     50 SpiderMonkey and Gecko can be built using either a periodically-updated copy of ICU in ``intl/icu/source`` (using time zone data in ``intl/tzdata/source``), or using a system-provided ICU library (dependent on its own ``tzdata`` information). Pass ``--with-system-icu`` when configuring to use system ICU. (Using system ICU will disable some ``Intl`` functionality, such as historically accurate time zone calculations, that can’t be readily supported without a precisely-controlled ICU.) ICU version requirements advance fairly quickly as Gecko depends on features and bug fixes in newer ICU releases. You’ll get a build error if you try to use an unsupported ICU.
     51 
     52 SpiderMonkey’s ``Intl`` API may be built or disabled by configuring ``--with-intl-api`` (the default) or ``--without-intl-api``. SpiderMonkey built without the ``Intl`` API doesn’t require ICU. However, if you build without the ``Intl`` API, some non-``Intl`` JavaScript functionality will not exist (``String.prototype.normalize``) or won’t fully work (for example, ``String.prototype.toLocale{Lower,Upper}Case`` will not respect a provided locale, and the various ``toLocaleString`` functions have best-effort behavior).
     53 
     54 Using ICU functionality in SpiderMonkey and Gecko
     55 =================================================
     56 
     57 ICU headers are considered system headers by the Gecko build system, so they must be listed in ``config/system-headers.mozbuild``. Code that wishes to use ICU functionality may use ``#include "unicode/unorm.h"`` or similar to do so.
     58 
     59 Gecko and SpiderMonkey code may use ICU’s stable C API (ICU4C). These functions are stable and shouldn’t change as ICU updates occur. (ICU4C’s ``enum`` initializers are not always stable: while initializer values are stable, new initializers are sometimes added, perhaps behind ``#ifdef U_HIDE_DRAFT_API``. This may be necessary for exhaustive ``switch``\ es to add ``#ifdef``\ s around some ``case``\ s.)
     60 
     61 Gecko and SpiderMonkey are strongly discouraged from using ICU’s C++ API (unfortunately including all smart pointer classes), because the C++ API doesn’t provide ICU4C’s compatibility guarantees. Rarely, we tolerate C++ API use when no stable option exists. But the API has to “look” reasonably stable, and we usually want to start a discussion with upstream about adding a stable API to eventually use. Use symbols from ``namespace icu`` to access ICU C++ functionality. *Talk to the current imported-ICU owner (presently Jeff Walden) before you start doing any of this!*
     62 
     63 SpiderMonkey and Gecko’s imported ICU
     64 =====================================
     65 
     66 Build system
     67 ------------
     68 
     69 The system for building ICU lives in ``config/external/icu`` and ``intl/icu/icu_sources_data.py``. We generate a Mozilla-compatible build system rather than using ICU’s build system. The build system is shared by SpiderMonkey and Gecko both.
     70 
     71 ICU includes functionality we never use, so we don’t naively compile all of it. We extract the list of files to compile from ``intl/icu/source/{common,i18n}/Makefile.in`` and then apply a manually-maintained list of unused files (stored in ``intl/icu_sources_data.py``) when we update ICU.
     72 
     73 Locale and time zone data
     74 -------------------------
     75 
     76 ICU contains a considerable amount of raw locale data: formatting characteristics for each locale, strings for things like currencies and languages for each locale, localized time zone specifiers, and so on. This data lives in human-readable files in ``intl/icu/source/data``. Time zone data in ``intl/tzdata/source`` is stored in partially-compiled formats (some of them only partly human-readable).
     77 
     78 However, a normal Gecko build never uses these files! Instead, both ICU and ``tzdata`` data are precompiled into a large, endian-specific ``icudtNNE.dat`` (``NN`` = ICU version, ``E`` = endianness) file. [#why-icudt-not-rebuilt-every-time]_ That file is added to ``config/external/icu/data/`` and is checked into the Mozilla tree, to be directly incorporated into Gecko/SpiderMonkey builds. For size reasons, only the little-endian version is checked into the tree. It is converted into a big-endian version when necessary during the build.
     79 
     80 ICU’s locale data covers *all* ICU internationalization features, including ones we never need. We trim locale data to size with a ``intl/icu/data_filter.json`` `data filter <https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md>`__ when compiling ``icudtNNE.dat``. Removing *too much* data won’t *necessarily* break the build, so it’s important that we have automated tests for the locale data we actually use in order to detect mistakes.
     81 
     82 .. [#why-icudt-not-rebuilt-every-time]
     83   ``icudtNNE.dat`` isn’t compiled during a SpiderMonkey/Gecko build because it would require ICU command-line tools. And it’s a pain to either compile and run them during the build, or to require them as build dependencies.
     84 
     85 Local patching of ICU and CLDR
     86 ------------------------------
     87 
     88 We generally don’t patch our copy of ICU except for compelling need. When we do patch, we usually only apply reasonably small patches that have been reviewed and landed upstream (so that our patch will be obsolete when we next update ICU).
     89 
     90 Local patches are stored in the ``intl/icu-patches`` directory. They’re applied when ICU is updated, so merely updating ICU files in place won’t persist changes across an ICU update.
     91 
     92 Patching ICU also allows for patching some parts and uses of CLDR, the data backing ICU operations. Note that this does not include character data, which is `updated separately <https://wiki.mozilla.org/I18n:Updating_Unicode_version>`__, and that any such patching does not affect any other CLDR uses. In particular, Fluent localization depends on Rust crates which themselves depend on CLDR data directly and separately from ICU. Any CLDR patches should remain reasonably small; larger changes such as adding support for a new locale should be done upstream.
     93 
     94 Updating imported code
     95 ----------------------
     96 
     97 The process of updating imported i18n-relevant code is *semi*-automated. We use a series of shell and Python scripts to do the job.
     98 
     99 Updating ICU
    100 ~~~~~~~~~~~~
    101 
    102 New ICU versions are announced on the `icu-announce <https://lists.sourceforge.net/lists/listinfo/icu-announce>`__ mailing list. Both release candidates and actual releases are announced here. It’s a good idea to attempt to update ICU when a release candidate is announced, just in case some serious problem is present (especially one that would be painful to fix through local patching).
    103 
    104 ``intl/update-icu.sh`` updates our ICU to a given ICU release: [#icu-git-argument]_
    105 
    106 .. code:: bash
    107 
    108   $ cd "$topsrcdir/intl"
    109   $ # Ensure certain Python modules in the tree are accessible when updating.
    110   $ export PYTHONPATH="$topsrcdir/python/mozbuild/"
    111   $ #               <URL to ICU Git>                       <release tag name>
    112   $ ./update-icu.sh https://github.com/unicode-org/icu.git release-67-1
    113 
    114 .. [#icu-git-argument]
    115   The ICU Git URL argument lets you update from a local ICU clone. This can speed up work when you’re updating to a new ICU release and need to adjust or add new local patches.
    116 
    117 But usually you’ll want to update to the latest commit from the corresponding ICU maintenance branch so that you pick up fixes landed post-release:
    118 
    119 .. code:: bash
    120 
    121   $ cd "$topsrcdir/intl"
    122   $ # Ensure certain Python modules in the tree are accessible when updating.
    123   $ export PYTHONPATH="$topsrcdir/python/mozbuild/"
    124   $ #               <URL to ICU Git>                       <maintenance name>
    125   $ ./update-icu.sh https://github.com/unicode-org/icu.git maint/maint-67
    126 
    127 Updating ICU will also update the language tag registry (which records language tag semantics needed to correctly implement ``Intl`` functionality). Therefore it’s likely necessary to update SpiderMonkey’s language tag handling after running this [#update-icu-warning-langtags]_. See below where the ``langtags`` mode of ``make_intl_data.py`` is discussed.
    128 
    129 .. [#update-icu-warning-langtags]
    130   ``update-icu.sh`` will print a notice as a reminder of this:
    131 
    132   .. code:: bash
    133 
    134      INFO: Please run 'js/src/builtin/intl/make_intl_data.py langtags' to update additional language tag files for SpiderMonkey.
    135 
    136 ``update-icu.sh`` is intended for *replayability*, not for hands-off runnability. It downloads ICU source, prunes various irrelevant files, replaces ``intl/icu/source`` with the new files – and then blindly applies local patches in fixed order.
    137 
    138 Often a local patch won’t apply, or new patches must be applied to successfully build. In this case you’ll have to manually edit ``update-icu.sh`` to abort after only *some* patches have been applied, make whatever changes are necessary by hand, generate a new/updated patch file by hand, then carefully reattempt updating. (The people who have updated ICU in the past, usually jwalden and anba, follow this awkward process and don’t have good ideas on how to improve it.)
    139 
    140 Any time ICU is updated, you’ll need to fully rebuild whichever of SpiderMonkey or Gecko you’re building. For SpiderMonkey, delete your object directory and reconfigure from scratch. For Gecko, change the message in the top-level `CLOBBER <https://searchfox.org/mozilla-central/source/CLOBBER>`__ file.
    141 
    142 Updating tzdata
    143 ~~~~~~~~~~~~~~~
    144 
    145 ICU contains a copy of ``tzdata``, but that copy is whatever ``tzdata`` release was current at the time the ICU release was finalized. Time zone data changes much more often than that: every time some national legislature or tinpot dictator decides to alter time zones. [#tzdata-release-frequency]_ The `tz-announce <https://mm.icann.org/pipermail/tz-announce/>`__ mailing list announces changes as they occur. (Note that we can’t *immediately* update when a release occurs: ICU’s `icu-data <https://github.com/unicode-org/icu-data>`__ repository must be updated before we can update our ``tzdata``.)
    146 
    147 .. [#tzdata-release-frequency]
    148   To give a sense of how frequently ``tzdata`` is updated, and the irregularity of releases over time:
    149 
    150   -  2019 had three ``tzdata`` releases, 2019a through 2019c.
    151   -  2018 had nine ``tzdata`` releases, 2018a through 2018i.
    152   -  2017 had three ``tzdata`` releases, 2017a through 2017c.
    153 
    154 Therefore, either (usually) after you update ICU *or* when a new ``tzdata`` release occurs, you’ll need to update our imported ``tzdata`` files. (If you do need to update time zone data, note that you’ll also need to additionally update SpiderMonkey’s time zone handling, described further below.) This also suitably updates ``config/external/icu/data/icudtNNE.dat``. (If you’ve just run ``update-icu.sh``, it will warn you that you need to do this. [#update-icu-warning-old-tzdata]_)
    155 
    156 .. [#update-icu-warning-old-tzdata]
    157   For example:
    158 
    159   ::
    160 
    161      WARN: Local tzdata (2020a) is newer than ICU tzdata (2019c), please run './update-tzdata.sh 2020a'
    162 
    163 First, make sure you have a usable ``icupkg`` on your system. [#icupkg-on-system]_ Then run the ``update-tzdata.sh`` script to update ``intl/tzdata`` and ``icudtNNE.dat``:
    164 
    165 .. code:: bash
    166 
    167   $ cd "$topsrcdir/intl"
    168   $ ./update-tzdata.sh 2020a # or whatever the latest release is
    169 
    170 .. [#icupkg-on-system]
    171   To install ``icupkg`` on your system:
    172 
    173   -  On Fedora, use ``sudo dnf install icu``.
    174   -  On Ubuntu, use ``sudo apt-get install icu-devtools``.
    175   -  On Mac OS X, use ``brew install icu4c``.
    176   -  On Windows, you’ll need to `download a binary build of ICU for Windows <https://github.com/unicode-org/icu/releases/tag/release-67-1>`__ and use the ``bin/icupkg.exe`` or ``bin64/icupkg.exe`` utility inside it.
    177 
    178   If you’re on Windows, or for some reason you don’t want to use the ``icupkg`` now in your ``$PATH``, you can manually specify it on the command line using the ``-e /path/to/icupkg`` flag:
    179 
    180   .. code:: bash
    181 
    182      $ cd "$topsrcdir/intl"
    183      $ ./update-tzdata.sh -e /path/to/icupkg 2020a # or whatever the latest release is
    184 
    185   *In principle*, the ``icupkg`` you use *should* be the one from the ICU release/maintenance branch being built: if there’s a mismatch, you might encounter an ICU “format version not supported” error. If you’re on Windows, make sure to download a binary build for that release/branch. On other platforms, you might have to build your own ICU from source. The steps required to do this are left as an exercise for the reader. (In the somewhat longer term, the update commands might be changed to do this themselves.)
    186 
    187 If ``tzdata`` must be updated on trunk, you’ll almost certainly have to backport the update to Beta and ESR. Don’t attempt to backport the literal patch; just run the appropriate commands documented here to do so.
    188 
    189 Updating SpiderMonkey ``Intl`` data
    190 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    191 
    192 SpiderMonkey itself can’t blindly invoke ICU to perform every i18n operation, because sometimes ICU behavior deviates from what web specifications require. Therefore, when ICU is updated, we also must update SpiderMonkey itself as well (including various generated tests). Such updating is performed using the various modes of ``js/src/builtin/make_intl_data.py``.
    193 
    194 Updating SpiderMonkey time zone handling
    195 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    196 
    197 The ECMAScript Internationalization API requires that time zone identifiers (``America/New_York``, ``Antarctica/McMurdo``, etc.) be interpreted according to `IANA <https://www.iana.org/time-zones>`__ semantics. Unfortunately, ICU doesn’t precisely implement those semantics. (See comments in ``js/src/builtin/intl/SharedIntlData.h`` for details.) Therefore SpiderMonkey has to do certain pre- and post-processing based on what’s in IANA but not in ICU, and what’s in ICU that isn’t in IANA.
    198 
    199 Use ``make_intl_data.py``\ ’s ``tzdata`` mode to update time zone information:
    200 
    201 .. code:: bash
    202 
    203   $ cd "$topsrcdir/js/src/builtin/intl"
    204   $ # make_intl_data.py requires yaml.
    205   $ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
    206   $ python3 ./make_intl_data.py tzdata
    207 
    208 The ``tzdata`` mode accepts two optional arguments that generally will not be needed:
    209 
    210 -  **``--tz``** will act using data from a local ``tzdata/`` directory containing raw ``tzdata`` source (note that this is *not* the same as what is in ``intl/tzdata/source``). It may be useful to help debug problems that arise during an update.
    211 -  **``--ignore-backzone``** will omit time zone information before 1970. SpiderMonkey and Gecko include this information by default. However, because (by deliberate policy) ``tzdata`` information before 1970 is not reliable to the same degree as data since 1970, and backzone data has a size cost, a SpiderMonkey embedding or custom Gecko build might decide to omit it.
    212 
    213 Updating SpiderMonkey language tag handling
    214 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    215 
    216 Language tags (``en``, ``de-CH``, ``ar-u-ca-islamicc``, and so on) are the primary means of specifying localization characteristics. The ECMAScript Internationalization API supports certain operations that depend upon the current state of the language tag registry (stored in the Unicode Common Locale Data Repository, CLDR, a repository of all locale-specific characteristics) that specifies subtag semantics:
    217 
    218 -  ``Intl.getCanonicalLocales`` and ``Intl.Locale`` must replace alias subtags with their preferred forms. For example, ``ar-u-ca-islamic-civil`` uses the preferred Islamic calendar subtag, while ``ar-u-ca-islamicc`` uses an alias.
    219 -  ``Intl.Locale.prototype.maximize`` and ``Intl.Locale.prototype.minimize`` accept a language tag and add or remove “likely” subtags from it. For example, ``de`` most likely refers to German using Latin script in Germany, so it maximizes to ``de-Latn-DE`` – and in reverse, ``de-Latn-DE`` minimizes to simply ``de``.
    220 
    221 These decisions vary over time: as countries change [#soviet-union]_, as customs change, as language prevalence in regions varies, etc.
    222 
    223 .. [#soviet-union]
    224   For just one relevant example, the breakup of the Soviet Union is the cause of numerous entries in the language tag registry. ``ru-SU``, Russian as used in the Soviet Union, is now expressed as ``ru-RU``, Russian as used in Russia; ``ab-SU``, Abkhazian as used in the Soviet Union, is now expressed as ``ab-GE``, Abkhazian as used in Georgia; and so on for all the other satellite states.
    225 
    226 Use ``make_intl_data.py``\ ’s ``langtags`` mode to update language tag information to the same CLDR version used by ICU:
    227 
    228 .. code:: bash
    229 
    230   $ cd "$topsrcdir/js/src/builtin/intl"
    231   $ # make_intl_data.py requires yaml.
    232   $ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
    233   $ python3 ./make_intl_data.py langtags
    234 
    235 The CLDR version used will be printed in the header of CLDR-sensitive generated files. For example, ``intl/components/src/LocaleGenerated.cpp`` currently begins with:
    236 
    237 .. code:: cpp
    238 
    239   // Generated by make_intl_data.py. DO NOT EDIT.
    240   // Version: CLDR-37
    241   // URL: https://unicode.org/Public/cldr/37/core.zip
    242 
    243 Updating SpiderMonkey currency support
    244 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    245 
    246 Currencies use different numbers of fractional digits in their preferred formatting. Most currencies use two decimal digits; a handful use no fractional digits or some other number. Currency fractional digit is maintained by ISO and must be updated as currencies change their preferred fractional digits or new currencies arise that don’t use two decimal digits.
    247 
    248 Currency updates are fairly uncommon, so it’ll be rare to need to update currency info. A `newsletter <https://www.currency-iso.org/en/home/amendments/newsletter.html>`__ periodically sends updates about changes.
    249 
    250 Use ``make_intl_data.py``\ ’s ``currency`` mode to update currency fractional digit information:
    251 
    252 .. code:: bash
    253 
    254   $ cd "$topsrcdir/js/src/builtin/intl"
    255   $ # make_intl_data.py requires yaml.
    256   $ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
    257   $ python3 ./make_intl_data.py currency
    258 
    259 Updating SpiderMonkey measurement formatting support
    260 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    261 
    262 The ``Intl`` API supports formatting numbers as measurement units (for example, “17 meters” or “42 meters per second”). It specifies a list of units that must be supported, that we centrally record in ``js/src/builtin/intl/SanctionedSimpleUnitIdentifiers.yaml``, that we verify are supported by ICU and generate supporting files from.
    263 
    264 If ``Intl``\ ’s list of supported units is ever updated, two separate changes will be required.
    265 
    266 First, ``intl/icu/data_filter.json`` must be updated to incorporate localized strings for the new unit. These strings are stored in ``icudtNNE.dat``, so you’ll have to re-update ICU (and likely reimport ``tzdata`` as well, if it’s been updated since the last ICU update) to rewrite that file.
    267 
    268 Second, use ``make_intl_data.py``\ ’s ``units`` mode to update unit handling and associated tests in SpiderMonkey:
    269 
    270 .. code:: bash
    271 
    272   $ cd "$topsrcdir/js/src/builtin/intl"
    273   $ # make_intl_data.py requires yaml.
    274   $ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
    275   $ python3 ./make_intl_data.py units
    276 
    277 Updating SpiderMonkey numbering systems support
    278 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    279 
    280 The ``Intl`` API also supports formatting numbers in various numbering systems (for example, “123“ using Latin numbers or “一二三“ using Han decimal numbers). The list of numbering systems that we must support is stored in ``js/src/builtin/intl/NumberingSystems.yaml``. We verify these numbering systems are supported by ICU and generate supporting files from it.
    281 
    282 When the list of supported numbering systems needs to be updated, run ``make_intl_data.py`` with the ``numbering`` mode to update it and associated tests in SpiderMonkey:
    283 
    284 .. code:: bash
    285 
    286   $ cd "$topsrcdir/js/src/builtin/intl"
    287   $ # make_intl_data.py requires yaml.
    288   $ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
    289   $ python3 ./make_intl_data.py numbering