tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

index.rst (9401B)


      1 ======================================
      2 Managing the built-in en-US dictionary
      3 ======================================
      4 
      5 The en-US build of Firefox includes a built-in Hunspell dictionary based on the
      6 `SCOWL`_ dataset. This document describes the process to add new words to the
      7 dictionary, or update it to the current upstream version.
      8 
      9 For more information about Hunspell or the affix file format, you can check
     10 `the Ubuntu man page for hunspell
     11 <https://manpages.ubuntu.com/manpages/bionic/man5/hunspell.5.html>`_.
     12 
     13 Requesting to add new words to the en-US dictionary
     14 ===================================================
     15 
     16 If you’d like to add new words to the dictionary, you can add your request to
     17 `this bug <https://bugzilla.mozilla.org/show_bug.cgi?id=enus-dictionary>`_:
     18 
     19 * Include all possible forms, e.g. plural and genitive forms for nouns,
     20  different tenses for verbs.
     21 * Try to provide information on the terms you want to add, in particular
     22  references to external sources that confirm the usage of the term (e.g.
     23  Merriam-Webster or Oxford online dictionaries).
     24 
     25 .. note::
     26 
     27  If you’re fixing the existing bug with pending requests, make sure to `file a
     28  new bug`_ and move the alias ``enus-dictionary`` (in the *Details* section)
     29  from the old bug to the new one.
     30 
     31 Adding new words to the en-US dictionary
     32 ========================================
     33 
     34 This section describes the process for adding new words to the dictionary:
     35 
     36 #. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick
     37   Reference`), if you don’t already have one, and make sure you can build it
     38   successfully.
     39 #. Move in the dictionary sources directory using this command:
     40   ``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
     41 #. Identify the current version of SCOWL by checking the file
     42   ``README_en_US.txt`` (at the beginning of the file there is a line similar to
     43   ``Generated from SCOWL Version 2020.12.07``, where ``2020.12.07`` is the
     44   SCOWL version).
     45 #. Download the same version of the dictionary from the `SCOWL`_ homepage or
     46   `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
     47   Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
     48 #. There’s a special script used for editing dictionaries. The script
     49   only works if you have the environment variable ``EDITOR`` set to the
     50   executable of an editor program; if you don’t have it set, you can use
     51   ``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can
     52   substitute it with another editor), or you can just type
     53   ``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified.
     54 
     55   Copy and paste the full list of words, then save and quit the editor. It’s
     56   not necessary to put the words in alphabetical order, as it will be corrected
     57   by the script.
     58 
     59   Note: you might need to install ``aspell`` on your system (e.g. via
     60   ``brew install aspell`` on macOS).
     61 #. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
     62   sure it runs without errors. For more details on this script, see the
     63   `make-new-dict.sh`_ section.
     64 #. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
     65   example, make sure that the size is about the same as the original dictionary
     66   (or slightly larger).
     67 #. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
     68   generated file in the right position.
     69 #. Build Firefox and test your updated dictionary. Once you’re
     70   satisfied, use the process described in :ref:`write_a_patch` to create a
     71   patch.
     72 
     73 Note that the update script will modify 2 versions of the dictionary, and both
     74 need to be committed:
     75 
     76 * ``en-US.dic``: the dictionary actually shipping in the build, it uses
     77  ISO-8859-1 encoding.
     78 * ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This
     79  is used to work around issues with Phabricator, and it allows to display
     80  actual changes in the diff.
     81 
     82 Exclude words from suggestions
     83 ==============================
     84 
     85 It’s possible to completely exclude words from suggested alternatives by adding
     86 an affix rule ``!`` at the end of the definition in the ``.dic`` file. For
     87 example:
     88 
     89 * ``bum`` would be changed to ``bum/!`` (note the additional forward slash).
     90 * ``bum/MS`` would be changed to ``bum/MS!``.
     91 
     92 In order to exclude a word from suggestions, follow the instructions available
     93 in `Adding new words to the en-US dictionary`_. Instead of running the
     94 ``edit-dictionary.sh`` script (point 5), use a text editor to edit the file
     95 ``en-US.dic`` directly, then proceed with the remaining instructions.
     96 
     97 .. warning::
     98 
     99  Make sure to open ``en-US.dic`` with the correct encoding. For example, Visual
    100  Studio Code will try to open it as ``UTF-8``, and it needs to be reopened with
    101  encoding ``Western (ISO 8859-1)``.
    102 
    103 Upgrading dictionary to a new upstream version of SCOWL
    104 =======================================================
    105 
    106 The English dictionary available in mozilla-central is based on the
    107 `SCOWL`_ dictionary. Some scripts distributed with the SCOWL package are
    108 used to generate the files for the en-US dictionary.
    109 
    110 The working directory for this process is
    111 ``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``.
    112 
    113 #. Download the latest version of the dictionary from the `SCOWL`_ homepage or
    114   `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory.
    115   Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``.
    116 #. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make
    117   sure it runs without errors. For more details on this script, see the
    118   `make-new-dict.sh`_ section.
    119 #. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For
    120   example, make sure that the size is about the same as the original dictionary
    121   (or slightly larger).
    122 #. If everything looks correct, use ``sh install-new-dict.sh`` to copy the
    123   generated file in the right position and use the process described in
    124   :ref:`write_a_patch` to create a patch.
    125 
    126 Info about the file structure
    127 =============================
    128 
    129 mozilla-specific.txt
    130 --------------------
    131 
    132 This file contains Mozilla-specific words that should not be submitted
    133 upstream. For example, ``Firefox`` should go in this file (see `bug 237921`_).
    134 
    135 Note that the file ``5-mozilla-specific.txt`` is generated by expanding
    136 ``mozilla-specific.txt`` and should not be edited directly.
    137 
    138 utf8 folder
    139 -----------
    140 
    141 ``dictionary-sources/utf8`` is used to store a copy with UTF-8 encoding of the
    142 dictionary files. This is used to work around limitations in Phabricator, which
    143 treats ISO-8859-1 files as binary and won’t display a diff when updating them.
    144 
    145 Info about the included scripts
    146 ===============================
    147 
    148 make-new-dict.sh
    149 ----------------
    150 
    151 The dictionary upgrade scripts ``make-new-dict.sh`` works by expanding (i.e.
    152 “unmunching”) the affix compression dictionaries to create wordlists and
    153 use those to generate a new dictionary.
    154 
    155 The upgrade script expects the current upstream version to be kept in the
    156 directory ``orig``.
    157 
    158 The script will create a few files in ``dictionary-sources/support_file`` in the
    159 following order:
    160 
    161 * ``0-special.txt`` contains numbers and ordinals expanded from SCOWL
    162  ``en.dic.supp``.
    163 * ``1-base.txt`` contains words expanded from ``en_US-custom.dic`` in the
    164  **previous** version of SCOWL (from the ``orig`` folder).
    165 * ``2-mozilla.txt`` contains words expanded from the current Mozilla dictionary.
    166 * ``3-upstream.txt`` contains words expanded from ``en_US-custom.dic`` in the
    167  **new** version of SCOWL (from the ``scowl/speller`` folder).
    168 * ``2-mozilla-removed.txt`` contains words that are only available in the SCOWL
    169  dictionary, i.e. removed by Mozilla.
    170 * ``2-mozilla-added.txt`` contains words that are only available in the current
    171  Mozilla dictionary, i.e. added by Mozilla.
    172 * ``4-patched.txt`` contains words from the new SCOWL dictionary
    173  (``3-upstream.txt``), with words from (``2-mozilla-removed.txt``) removed and
    174  words (``2-mozilla-added.txt``) added.
    175 * ``5-mozilla-specific.txt`` is expanded from ``mozilla-specific.txt`` using the
    176  current affix rules from the Mozilla dictionary.
    177 * ``5-mozilla-removed.txt`` and ``5-mozilla-added.txt`` contain words that are
    178  respectively removed and added by Mozilla compared to the **new** SCOWL
    179  version. These files could be used to submit upstream changes, but words
    180  included in ``5-mozilla-specific.txt`` should be removed from this list.
    181 
    182 The new dictionary is available as ``en_US-mozilla.dic`` and should be copied
    183 over using the ``install-new-dict.sh`` script.
    184 
    185 install-new-dict.sh
    186 -------------------
    187 
    188 The script:
    189 
    190 * Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new
    191  upstream version to ``orig``.
    192 * Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``.
    193 * Converts the dictionary (.dic) generated by ``make-new-dict.sh`` from UTF-8 to
    194  ISO-8859-1 and moves it to the parent folder.
    195 * Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the
    196  original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables).
    197 
    198 
    199 .. _SCOWL: http://wordlist.aspell.net
    200 .. _file a new bug: https://bugzilla.mozilla.org/show_bug.cgi?id=enus-dictionary
    201 .. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/
    202 .. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921