changes.txt (327271B)
1 * Copyright (C) 2016 and later: Unicode, Inc. and others. 2 * License & terms of use: http://www.unicode.org/copyright.html 3 * Copyright (C) 2004-2016, International Business Machines 4 * Corporation and others. All Rights Reserved. 5 * 6 * file name: changes.txt 7 * encoding: US-ASCII 8 * tab size: 8 (not used) 9 * indentation:4 10 * 11 * created on: 2004may06 12 * created by: Markus W. Scherer 13 14 * change log for Unicode updates 15 16 For an overview, see https://unicode-org.github.io/icu/processes/unicode-update 17 18 Notes: 19 20 This log includes several command lines as used in the update process. 21 Some of them include a console prompt with the present working directory (pwd) followed by a $ sign. 22 Use a console window that is set to that directory, or cd to there, 23 and then paste the command that follows the $ sign. 24 25 Most command lines use environment variables to make them more portable across versions 26 and machine configurations. When you set up a console window, copy & paste the `export` commands 27 from near the top of the current section before pasting tool command lines. 28 Adjust the environment variables to the current version and your machine setup. 29 (The command lines are currently as used on Linux.) 30 31 Syntax of this file: 32 33 `***` - section heading 34 `*` - sub heading 35 `-` - 1st level bullet 36 `+` - 2nd level bullet 37 `=` - 1st level bullet 38 `->` - "the previous things leads to...", OR a 2nd level bullet/item 39 40 ---------------------------------------------------------------------------- *** 41 42 * New ISO 15924 script codes 43 44 Normally, add new script codes as part of a Unicode update. 45 See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums 46 and see the change logs below. 47 48 ---------------------------------------------------------------------------- *** 49 50 TODO: Run gencolusb for Unicode updates. 51 - https://github.com/markusicu/icu/blob/main/icu4c/source/tools/gencolusb/README.md 52 - until ICU-12062 is done 53 54 ---------------------------------------------------------------------------- *** 55 56 Unicode 17.0 update for ICU 78 57 58 https://www.unicode.org/versions/Unicode17.0.0/ 59 https://www.unicode.org/versions/beta-17.0.0.html 60 https://www.unicode.org/Public/draft/ 61 https://www.unicode.org/reports/uax-proposed-updates.html 62 https://www.unicode.org/reports/tr44/tr44-35.html 63 64 https://unicode-org.atlassian.net/browse/ICU-23038 Unicode 17 65 https://unicode-org.atlassian.net/browse/CLDR-18283 BRS Unicode 17 66 67 * Command-line environment setup 68 69 Markus: 70 71 export UNIDATA_ROOT=~/unidata 72 export UNICODE_DATA=$UNIDATA_ROOT/uni17/final 73 export CLDR_SRC=~/cldr/uni/src 74 export ICU_ROOT=~/icu/uni 75 export ICU_SRC=$ICU_ROOT/src 76 export ICU_OUT=$ICU_ROOT/dbg 77 export ICUDT=icudt78b 78 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 79 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 80 export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 81 export UNICODE_TOOLS=~/unitools/mine/src 82 83 Elango: 84 85 export UNIDATA_ROOT=~/oss/unidata 86 export UNICODE_DATA=$UNIDATA_ROOT/uni17/final 87 export CLDR_SRC=~/oss/cldr/mine/src 88 export ICU_ROOT=~/oss/icu 89 export ICU_SRC=$ICU_ROOT 90 export ICU_OUT=$ICU_ROOT 91 export ICUDT=icudt78b 92 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 93 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 94 export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 95 export UNICODE_TOOLS=~/oss/unicodetools/mine/src 96 97 *** Unicode version numbers 98 - icu4c/source/data/makedata.mak 99 - icu4c/source/common/unicode/uchar.h 100 - com.ibm.icu.util.VersionInfo 101 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 102 103 *** Configure: Build Unicode data for ICU4J 104 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 105 so that the makefiles see the new version number. 106 - FYI: The option that adds the additional Unicode data files for ICU4J is 107 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data 108 - Markus's version: 109 cd $ICU_OUT/icu4c 110 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ../../src/icu4c/source/runConfigureICU --enable-debug --disable-release Linux/clang --prefix=/usr/local/google/home/mscherer/icu/mine/inst/icu4c > config.out 2>&1 ; tail config.out 111 - Elango's version (diff default C++ compiler & in-source build paths): 112 cd $ICU_OUT/icu4c/source 113 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ./runConfigureICU --enable-debug --disable-release Linux/gcc --prefix=/usr/local/google/home/elango/oss/icu/icu4c > config.out 2>&1 ; tail config.out 114 115 *** data files & enums & parser code 116 117 * download files 118 - same as for the early Unicode Tools setup and data refresh: 119 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 120 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 121 - mkdir -p $UNICODE_DATA 122 - download Unicode files into $UNICODE_DATA 123 + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft 124 + subfolders: emoji, idna, security, ucd, uca 125 + for pre-release (alpha, beta) data files: 126 ~ if one of us produces the alpha.zip or beta.zip collection of data files for publication, 127 then we can use its contents directly (no FTP from unicode.org necessary) 128 ~ otherwise download from https://www.unicode.org/Public/draft/ 129 ~ you can omit or discard the charts/ and ucdxml/ files/folders 130 ~ you can omit or discard ucd/UCD.zip & ucd/Unihan.zip & security/*.zip 131 + alternate way of fetching files, if available: 132 copy the files from a Unicode Tools workspace that is up to date with 133 https://github.com/unicode-org/unicodetools 134 and which might at this point be *ahead* of "Public" 135 ~ before the Unicode release copy files from "dev" subfolders, for example 136 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 137 + for final-release data files, the source of truth is the files in 138 https://www.unicode.org/Public/(version) 139 140 * process and/or copy files 141 - cd $ICU_SRC/tools/unicode 142 py/preparseucd.py $UNICODE_DATA $ICU_SRC 143 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 144 + For debugging, and tweaking how ppucd.txt is written, 145 the tool has an --only_ppucd option: 146 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 147 e.g. 148 py/preparseucd.py $UNICODE_DATA --only_ppucd /tmp/ppucd.txt 149 150 * new constants for new property values 151 - preparseucd.py error: 152 ValueError: missing uchar.h enum constants for some property values: [('blk', {'Tangut_Components_Sup', 'Misc_Symbols_Sup', 'CJK_Ext_J', 'Tai_Yo', 'Chisoi', 'Sharada_Sup', 'Beria_Erfe', 'Tolong_Siki', 'Sidetic'}), ('jg', {'Thin_Noon'}), ('lb', {'HH'}), ('sc', {'Chis', 'Sidt', 'Tols', 'Berf', 'Tayo'})] 153 = PropertyValueAliases.txt new property values (diff old & new .txt files) 154 (cd $UNIDATA_ROOT && diff -u uni16/final/UCD/ucd/PropertyValueAliases.txt uni17/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]') 155 +age; 17.0 ; V17_0 156 +blk; Beria_Erfe ; Beria_Erfe 157 +blk; Chisoi ; Chisoi 158 +blk; CJK_Ext_J ; CJK_Unified_Ideographs_Extension_J 159 +blk; Misc_Symbols_Sup ; Miscellaneous_Symbols_Supplement 160 +blk; Sharada_Sup ; Sharada_Supplement 161 +blk; Sidetic ; Sidetic 162 +blk; Tai_Yo ; Tai_Yo 163 +blk; Tangut_Components_Sup ; Tangut_Components_Supplement 164 +blk; Tolong_Siki ; Tolong_Siki 165 +jg ; Thin_Noon ; Thin_Noon 166 +lb ; HH ; Unambiguous_Hyphen 167 +sc ; Berf ; Beria_Erfe 168 +sc ; Chis ; Chisoi 169 +sc ; Sidt ; Sidetic 170 +sc ; Tayo ; Tai_Yo 171 +sc ; Tols ; Tolong_Siki 172 + copy new API constants from the preparseucd.py output into the .h/.java files, 173 add/adjust comments, wrap lines, and set numeric values 174 + (ignore Age: no API constants for that) 175 + Block: 176 uchar.h before UBLOCK_COUNT, 177 UCharacter.UnicodeBlock IDs before COUNT, 178 UCharacter.UnicodeBlock objects before INVALID_CODE 179 + Script: uscript.h & com.ibm.icu.lang.UScript 180 + for new scripts: fix expectedLong names 181 in cintltst/cucdapi.c/TestUScriptCodeAPI() 182 and in com.ibm.icu.dev.test.lang.TestUScript.java 183 + Indic_Syllabic_Category: uchar.h & UCharacter.IndicSyllabicCategory 184 + Note: preparseucd.py does not write constants for values of every property. 185 Add some manually, or write more generator code. 186 + after adding new API constants, run preparseucd.py again 187 188 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 189 (not strictly necessary for NOT_ENCODED scripts) 190 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 191 192 * build ICU 193 to make sure that there are no syntax errors 194 195 $ICU_OUT/icu4c$ echo;echo; date; make -j20 tests &> out.txt ; tail -n 30 out.txt ; date 196 197 * Bazel build process 198 199 See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 200 for an overview and for setup instructions. 201 202 Consider running `bazelisk --version` outside of the $ICU_SRC folder 203 to find out the latest `bazel` version, and 204 copying that version number into the $ICU_SRC/.bazeliskrc config file. 205 (Revert if you find incompatibilities, or, better, update our build & config files.) 206 207 * TODO: 208 Error when upgrading from Bazel 7.2.1 to Bazel 8.2.1: 209 ERROR: Skipping '//icu4c/source/tools/gennorm2': error loading package 'icu4c/source/tools/gennorm2': Unable to find package for @@[unknown repo 'rules_cc' requested from @@]//cc:defs.bzl: The repository '@@[unknown repo 'rules_cc' requested from @@]' could not be resolved: No repository visible as '@rules_cc' from main repository. Was the repository introduced in WORKSPACE? The WORKSPACE file is disabled by default in Bazel 8 (late 2024) and will be removed in Bazel 9 (late 2025), please migrate to Bzlmod. See https://bazel.build/external/migration. 210 Need to revisit! 211 212 * generate data files 213 214 - remember to define the environment variables 215 (see the start of the section for this Unicode version) 216 - cd $ICU_SRC 217 - optional but not necessary: 218 bazelisk clean 219 or even 220 bazelisk clean --expunge 221 - build/bootstrap/generate new files: 222 icu4c/source/data/unidata/generate.sh 223 224 * NOTE: propsVectorsTrie_index in uprops.icu / uchar_props_data.h 225 (and a bit of propsVectors) 226 increased by some 22kB, probably mostly due to the revised Identifier_Type data, 227 especially for Unified_Ideograph characters. 228 229 * run & fix ICU4C tests 230 - Note: Some of the collation data and test data will be updated below, 231 so at this time we might get some collation test failures. 232 Ignore these for now. 233 - Some properties are hardcoded in the ICU libraries because they apply to 234 few characters or ranges, and are not expected to change often. 235 They are tested at least in C++ intltest (e.g., against ppucd.txt). 236 If these tests fail, then update the implementation and the tests. 237 - Robin or Andy helps with RBBI & spoof check test failures 238 239 * collation: CLDR collation root, UCA DUCET 240 241 - UCA DUCET goes into Mark's Unicode tools, 242 and a tool-tailored version goes into CLDR, see 243 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 244 245 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 246 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 247 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 248 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 249 (note removing the underscore before "Rules") 250 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 251 - restore TODO diffs in UCARules.txt; adjust boundaries as needed, e.g., for new currency symbols 252 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 253 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 254 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 255 from the CLDR root files (..._CLDR_..._SHORT.txt) 256 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 257 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 258 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/collate/src/test/resources/com/ibm/icu/dev/data 259 - if CLDR common/uca/unihan-index.txt changes, then update 260 CLDR common/collation/root.xml <collation type="private-unihan"> 261 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 262 263 - update CollationFCD.java: 264 copy & paste the initializers of lcccIndex[] etc. 265 from 266 $ICU_SRC/icu4c/source/i18n/collationfcd.cpp 267 to 268 $ICU_SRC/icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationFCD.java 269 - generate data files, as above (generate.sh), now to pick up new collation data 270 - rebuild ICU4C (make clean, make check, as usual) 271 272 * Unihan collators 273 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 274 - run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 275 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 276 - generate ICU zh collation data 277 Follow the tools/cldr/cldr-to-icu/README.md file. 278 + setup: 279 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 280 (didn't work without setting JAVA_HOME, 281 nor with the Google default of /usr/local/buildtools/java/jdk 282 [Google security limitations in the XML parser]) 283 export PATH=$JAVA_HOME/bin:$PATH 284 export TOOLS_ROOT=$ICU_SRC/tools 285 export ICU_DIR=$ICU_SRC 286 export CLDR_DIR=$CLDR_SRC 287 export CLDR_DATA_DIR=$CLDR_DIR 288 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 289 + build & run Java code 290 Follow the instructions in the TLDR section of the cldr-to-icu/README.md file. 291 In TLDR "Run the conversion tool" add parameters to generate only the files we need: 292 java -jar target/cldr-to-icu-1.0-SNAPSHOT-jar-with-dependencies.jar --outDir=/tmp/icu --outputTypes=coll,transforms --localeIdFilter='zh.*' --dontGenCode 293 + diff 294 cd $ICU_SRC 295 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 296 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 297 + copy into the source tree 298 cd $ICU_SRC 299 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 300 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 301 - rebuild ICU4C 302 303 * run & fix ICU4C tests, now with new CLDR collation root data 304 - run all tests with the collation test data *_SHORT.txt or the full files 305 (the full ones have comments, useful for debugging) 306 - note on intltest: if collate/UCAConformanceTest fails, then 307 utility/MultithreadTest/TestCollators will fail as well; 308 fix the conformance test before looking into the multi-thread test 309 310 * update Java data files 311 - refresh just the UCD/UCA-related/derived files, just to be safe 312 - see (ICU4C)/source/data/icu4j-readme.txt 313 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 314 - $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 315 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 316 you need to reconfigure with unicore data; see the "configure" line above. 317 output: 318 ... 319 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 320 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudata 321 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudata 322 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt78l.dat ./out/icu4j/icudt78b.dat -s ./out/build/icudt78l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudata 323 mv ./out/icu4j/"com/ibm/icu/impl/data/icudata/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudata/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudata/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudata/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudata" 324 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudata/ 325 mkdir -p /tmp/icu4j/main/shared/data 326 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 327 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudata/ 328 mkdir -p /tmp/icu4j/main/shared/data 329 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 330 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 331 - copy the binary data files into the ICU4J tree 332 cd $ICU_OUT/icu4c/data/out/icu4j 333 cp -v com/ibm/icu/impl/data/icudata/coll/* $ICU_SRC/icu4j/main/collate/src/main/resources/com/ibm/icu/impl/data/icudata/coll 334 cp -v com/ibm/icu/impl/data/icudata/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr 335 cp -v com/ibm/icu/impl/data/icudata/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata 336 cp -v com/ibm/icu/impl/data/icudata/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata 337 cd com/ibm/icu/impl/data/icudata/ 338 ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata";}' | sh 339 - The procedure above is very conservative: 340 It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update. 341 It avoids dealing with any other discrepancies 342 between the source and generated data files. 343 *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C: 344 $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 345 346 * refresh Java test .txt files 347 - copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode 348 cd $ICU_SRC/icu4c/source/data/unidata 349 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 350 cd ../../test/testdata 351 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 352 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 353 354 * run & fix ICU4J tests 355 356 *** API additions 357 - send notice to icu-design about new born-@stable API (enum constants etc.) 358 359 *** CLDR numbering systems 360 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 361 for example: 362 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/16.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 363 --> 364 +11DE0..11DE9 ; Nd # [10] TOLONG SIKI DIGIT ZERO..TOLONG SIKI DIGIT NINE 365 +16DA0..16DA9 ; Nd # [10] CHISOI DIGIT ZERO..CHISOI DIGIT NINE 366 --> https://github.com/unicode-org/cldr/pull/4726 367 (FYI: Chisoi was later removed from Unicode 17) 368 369 *** merge the Unicode update branch back onto the main branch 370 - make sure that changes to Unicode tools are checked in: 371 https://github.com/unicode-org/unicodetools 372 373 ---------------------------------------------------------------------------- *** 374 375 Unicode 16.0 update for ICU 76 376 377 https://www.unicode.org/versions/Unicode16.0.0/ 378 https://www.unicode.org/versions/beta-16.0.0.html 379 https://www.unicode.org/Public/draft/ 380 https://www.unicode.org/reports/uax-proposed-updates.html 381 https://www.unicode.org/reports/tr44/tr44-33.html 382 383 https://unicode-org.atlassian.net/browse/ICU-22707 Unicode 16 384 https://unicode-org.atlassian.net/browse/CLDR-17226 BRS Unicode 16 385 386 https://github.com/unicode-org/unicodetools/pull/774 delete the RecommendedSetGenerator 387 388 https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1 389 390 * Command-line environment setup 391 392 Markus: 393 394 export UNIDATA_ROOT=~/unidata 395 export UNICODE_DATA=$UNIDATA_ROOT/uni16/final 396 export CLDR_SRC=~/cldr/uni/src 397 export ICU_ROOT=~/icu/uni 398 export ICU_SRC=$ICU_ROOT/src 399 export ICU_OUT=$ICU_ROOT/dbg 400 export ICUDT=icudt76b 401 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 402 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 403 export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 404 export UNICODE_TOOLS=~/unitools/mine/src 405 406 Elango: 407 408 export UNIDATA_ROOT=~/oss/unidata 409 export UNICODE_DATA=$UNIDATA_ROOT/uni16/final 410 export CLDR_SRC=~/oss/cldr/mine/src 411 export ICU_ROOT=~/oss/icu 412 export ICU_SRC=$ICU_ROOT 413 export ICU_OUT=$ICU_ROOT 414 export ICUDT=icudt76b 415 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 416 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 417 export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 418 export UNICODE_TOOLS=~/oss/unicodetools/mine/src 419 420 *** Unicode version numbers 421 - icu4c/source/data/makedata.mak 422 - icu4c/source/common/unicode/uchar.h 423 - com.ibm.icu.util.VersionInfo 424 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 425 426 *** Configure: Build Unicode data for ICU4J 427 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 428 so that the makefiles see the new version number. 429 - FYI: The option that adds the additional Unicode data files for ICU4J is 430 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data 431 - Markus's version: 432 cd $ICU_OUT/icu4c 433 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ../../src/icu4c/source/runConfigureICU --enable-debug --disable-release Linux/clang --prefix=/usr/local/google/home/mscherer/icu/mine/inst/icu4c > config.out 2>&1 ; tail config.out 434 - Elango's version (diff default C++ compiler & in-source build paths): 435 cd $ICU_OUT/icu4c/source 436 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ./runConfigureICU --enable-debug --disable-release Linux/gcc --prefix=/usr/local/google/home/elango/oss/icu/icu4c > config.out 2>&1 ; tail config.out 437 438 *** data files & enums & parser code 439 440 * download files 441 - same as for the early Unicode Tools setup and data refresh: 442 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 443 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 444 - mkdir -p $UNICODE_DATA 445 - download Unicode files into $UNICODE_DATA 446 + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc. 447 + subfolders: emoji, idna, security, ucd, uca 448 + for pre-release (alpha, beta) data files: 449 ~ if one of us produces the alpha.zip or beta.zip collection of data files for publication, 450 then we can use its contents directly (no FTP from unicode.org necessary) 451 ~ otherwise download from https://www.unicode.org/Public/draft/ 452 ~ you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders 453 ~ you can omit or discard UCD/ucd/Unihan.zip 454 + alternate way of fetching files, if available: 455 copy the files from a Unicode Tools workspace that is up to date with 456 https://github.com/unicode-org/unicodetools 457 and which might at this point be *ahead* of "Public" 458 ~ before the Unicode release copy files from "dev" subfolders, for example 459 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 460 + for final-release data files, the source of truth is the files in 461 https://www.unicode.org/Public/(version) [=UCD], 462 https://www.unicode.org/Public/UCA/(version), 463 https://www.unicode.org/Public/idna/(version), 464 etc. 465 - get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already) 466 or from the UCD/cldr/ output folder of the Unicode Tools: 467 From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73, 468 CLDR used modified grapheme break rules. 469 This might happen again. 470 + To check in the Unicode Tools workspace: 471 ~/unitools/mine/Generated$ meld UCD/16.0.0/auxiliary/*GraphemeBreakTest.txt UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt 472 + If different, and after copying into CLDR: 473 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 474 or 475 cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 476 cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 477 cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 478 + We may need CLDR versions of WordBreakTest.txt and LineBreakTest.txt 479 unless Unicode 16 and CLDR 46 eliminate their differences: 480 unicodetools issue #492 481 482 * process and/or copy files 483 - cd $ICU_SRC/tools/unicode 484 py/preparseucd.py $UNICODE_DATA $ICU_SRC 485 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 486 + For debugging, and tweaking how ppucd.txt is written, 487 the tool has an --only_ppucd option: 488 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 489 e.g. 490 py/preparseucd.py $UNICODE_DATA --only_ppucd /tmp/ppucd.txt 491 492 * new constants for new property values 493 - preparseucd.py error: 494 ValueError: missing uchar.h enum constants for some property values: 495 [('blk', {'Garay', 'Tulu_Tigalari', 'Todhri', 'Sunuwar', 'Egyptian_Hieroglyphs_Ext_A', 'Kirat_Rai', 'Symbols_For_Legacy_Computing_Sup', 'Myanmar_Ext_C', 'Ol_Onal', 'Gurung_Khema'}), 496 ('sc', {'Gara', 'Onao', 'Todr', 'Krai', 'Tutg', 'Sunu', 'Gukh'}), 497 ('InSC', {'Reordering_Killer'})] 498 = PropertyValueAliases.txt new property values (diff old & new .txt files) 499 (cd $UNIDATA_ROOT && diff -u uni15.1/final/ucd/PropertyValueAliases.txt uni16/alpha/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]') 500 +age; 16.0 ; V16_0 501 +blk; Egyptian_Hieroglyphs_Ext_A ; Egyptian_Hieroglyphs_Extended_A 502 +blk; Garay ; Garay 503 +blk; Gurung_Khema ; Gurung_Khema 504 +blk; Kirat_Rai ; Kirat_Rai 505 +blk; Myanmar_Ext_C ; Myanmar_Extended_C 506 +blk; Ol_Onal ; Ol_Onal 507 +blk; Sunuwar ; Sunuwar 508 +blk; Symbols_For_Legacy_Computing_Sup ; Symbols_For_Legacy_Computing_Supplement 509 +blk; Todhri ; Todhri 510 +blk; Tulu_Tigalari ; Tulu_Tigalari 511 +InSC; Reordering_Killer ; Reordering_Killer 512 -jg ; Teh_Marbuta_Goal ; Hamza_On_Heh_Goal 513 +jg ; Teh_Marbuta_Goal ; Teh_Marbuta_Goal ; Hamza_On_Heh_Goal 514 +sc ; Gara ; Garay 515 +sc ; Gukh ; Gurung_Khema 516 +sc ; Krai ; Kirat_Rai 517 +sc ; Onao ; Ol_Onal 518 +sc ; Sunu ; Sunuwar 519 +sc ; Todr ; Todhri 520 +sc ; Tutg ; Tulu_Tigalari 521 + copy new API constants from the preparseucd.py output into the .h/.java files, 522 add/adjust comments, wrap lines, and set numeric values 523 + (ignore Age: no API constants for that) 524 + Block: uchar.h before UBLOCK_COUNT, 525 UCharacter.UnicodeBlock IDs, UCharacter.UnicodeBlock objects 526 + Script: uscript.h & com.ibm.icu.lang.UScript 527 + for new scripts: fix expectedLong names 528 in cintltst/cucdapi.c/TestUScriptCodeAPI() 529 and in com.ibm.icu.dev.test.lang.TestUScript.java 530 + Indic_Syllabic_Category: uchar.h & UCharacter.IndicSyllabicCategory 531 + after adding new API constants, run preparseucd.py again 532 533 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 534 (not strictly necessary for NOT_ENCODED scripts) 535 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 536 537 * build ICU 538 to make sure that there are no syntax errors 539 540 $ICU_OUT/icu4c$ echo;echo; date; make -j20 tests &> out.txt ; tail -n 30 out.txt ; date 541 542 * Bazel build process 543 544 See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 545 for an overview and for setup instructions. 546 547 Consider running `bazelisk --version` outside of the $ICU_SRC folder 548 to find out the latest `bazel` version, and 549 copying that version number into the $ICU_SRC/.bazeliskrc config file. 550 (Revert if you find incompatibilities, or, better, update our build & config files.) 551 552 * generate data files 553 554 - remember to define the environment variables 555 (see the start of the section for this Unicode version) 556 - cd $ICU_SRC 557 - optional but not necessary: 558 bazelisk clean 559 or even 560 bazelisk clean --expunge 561 - build/bootstrap/generate new files: 562 icu4c/source/data/unidata/generate.sh 563 564 * run & fix ICU4C tests 565 - Note: Some of the collation data and test data will be updated below, 566 so at this time we might get some collation test failures. 567 Ignore these for now. 568 - Some properties are hardcoded in the ICU libraries because they apply to 569 few characters or ranges, and are not expected to change often. 570 They are tested at least in C++ intltest (e.g., against ppucd.txt). 571 If these tests fail, then update the implementation and the tests. 572 - update CLDR GraphemeBreakTest.txt 573 (see the download section above about this file) 574 cd ~/unitools/mine/Generated 575 cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 576 cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 577 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 578 - Robin or Andy helps with RBBI & spoof check test failures 579 580 * collation: CLDR collation root, UCA DUCET 581 582 - UCA DUCET goes into Mark's Unicode tools, 583 and a tool-tailored version goes into CLDR, see 584 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 585 586 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 587 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 588 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 589 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 590 (note removing the underscore before "Rules") 591 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 592 - restore TODO diffs in UCARules.txt 593 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 594 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 595 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 596 from the CLDR root files (..._CLDR_..._SHORT.txt) 597 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 598 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 599 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/collate/src/test/resources/com/ibm/icu/dev/data 600 - if CLDR common/uca/unihan-index.txt changes, then update 601 CLDR common/collation/root.xml <collation type="private-unihan"> 602 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 603 604 - update CollationFCD.java: 605 copy & paste the initializers of lcccIndex[] etc. 606 from 607 $ICU_SRC/icu4c/source/i18n/collationfcd.cpp 608 to 609 $ICU_SRC/icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationFCD.java 610 - generate data files, as above (generate.sh), now to pick up new collation data 611 - rebuild ICU4C (make clean, make check, as usual) 612 613 * Unihan collators 614 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 615 - run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 616 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 617 - generate ICU zh collation data 618 WARNING: outdated, don't do this, follow the tools/cldr/cldr-to-icu/README.md file! 619 --- Old text from here: 620 instructions inspired by 621 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 622 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 623 + setup: 624 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 625 (didn't work without setting JAVA_HOME, 626 nor with the Google default of /usr/local/buildtools/java/jdk 627 [Google security limitations in the XML parser]) 628 export TOOLS_ROOT=$ICU_SRC/tools 629 export CLDR_DIR=$CLDR_SRC 630 export CLDR_DATA_DIR=$CLDR_DIR 631 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 632 cd "$TOOLS_ROOT/cldr/lib" 633 ./install-cldr-jars.sh "$CLDR_DIR" 634 + generate the files we need 635 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 636 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 637 + diff 638 cd $ICU_SRC 639 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 640 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 641 + copy into the source tree 642 cd $ICU_SRC 643 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 644 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 645 - rebuild ICU4C 646 647 * run & fix ICU4C tests, now with new CLDR collation root data 648 - run all tests with the collation test data *_SHORT.txt or the full files 649 (the full ones have comments, useful for debugging) 650 - note on intltest: if collate/UCAConformanceTest fails, then 651 utility/MultithreadTest/TestCollators will fail as well; 652 fix the conformance test before looking into the multi-thread test 653 654 * update Java data files 655 - refresh just the UCD/UCA-related/derived files, just to be safe 656 - see (ICU4C)/source/data/icu4j-readme.txt 657 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 658 - $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 659 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 660 you need to reconfigure with unicore data; see the "configure" line above. 661 output: 662 ... 663 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 664 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt76b 665 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b 666 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt76l.dat ./out/icu4j/icudt76b.dat -s ./out/build/icudt76l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt76b 667 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b" 668 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt76b/ 669 mkdir -p /tmp/icu4j/main/shared/data 670 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 671 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt76b/ 672 mkdir -p /tmp/icu4j/main/shared/data 673 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 674 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 675 - copy the binary data files into the ICU4J tree 676 cd $ICU_OUT/icu4c/data/out/icu4j 677 cp -v com/ibm/icu/impl/data/icudata/coll/* $ICU_SRC/icu4j/main/collate/src/main/resources/com/ibm/icu/impl/data/icudata/coll 678 cp -v com/ibm/icu/impl/data/icudata/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr 679 cp -v com/ibm/icu/impl/data/icudata/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata 680 cp -v com/ibm/icu/impl/data/icudata/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata 681 cd com/ibm/icu/impl/data/icudata/ 682 ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata";}' | sh 683 - The procedure above is very conservative: 684 It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update. 685 It avoids dealing with any other discrepancies 686 between the source and generated data files. 687 *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C: 688 $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 689 690 * refresh Java test .txt files 691 - copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode 692 cd $ICU_SRC/icu4c/source/data/unidata 693 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 694 cd ../../test/testdata 695 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 696 cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 697 698 * run & fix ICU4J tests 699 700 *** API additions 701 - send notice to icu-design about new born-@stable API (enum constants etc.) 702 703 *** CLDR numbering systems 704 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 705 for example: 706 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.1.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 707 --> 708 +10D40..10D49 ; Nd # [10] GARAY DIGIT ZERO..GARAY DIGIT NINE 709 +116D0..116E3 ; Nd # [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE 710 +11BF0..11BF9 ; Nd # [10] SUNUWAR DIGIT ZERO..SUNUWAR DIGIT NINE 711 +16130..16139 ; Nd # [10] GURUNG KHEMA DIGIT ZERO..GURUNG KHEMA DIGIT NINE 712 +16D70..16D79 ; Nd # [10] KIRAT RAI DIGIT ZERO..KIRAT RAI DIGIT NINE 713 +1CCF0..1CCF9 ; Nd # [10] OUTLINED DIGIT ZERO..OUTLINED DIGIT NINE 714 +1E5F1..1E5FA ; Nd # [10] OL ONAL DIGIT ZERO..OL ONAL DIGIT NINE 715 --> https://github.com/unicode-org/cldr/pull/3658 716 717 *** merge the Unicode update branch back onto the main branch 718 - make sure that changes to Unicode tools are checked in: 719 https://github.com/unicode-org/unicodetools 720 721 ---------------------------------------------------------------------------- *** 722 723 Unicode 15.1 update for ICU 74 724 725 https://www.unicode.org/versions/Unicode15.1.0/ 726 https://www.unicode.org/versions/beta-15.1.0.html 727 https://www.unicode.org/Public/draft/ 728 https://www.unicode.org/reports/uax-proposed-updates.html 729 https://www.unicode.org/reports/tr44/tr44-31.html 730 731 https://unicode-org.atlassian.net/browse/ICU-22404 Unicode 15.1 732 https://unicode-org.atlassian.net/browse/CLDR-16669 BRS Unicode 15.1 733 734 https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1 735 736 * Command-line environment setup 737 738 Markus: 739 740 export UNIDATA_ROOT=~/unidata 741 export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/final 742 export CLDR_SRC=~/cldr/uni/src 743 export ICU_ROOT=~/icu/uni 744 export ICU_SRC=$ICU_ROOT/src 745 export ICU_OUT=$ICU_ROOT/dbg 746 export ICUDT=icudt74b 747 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 748 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 749 export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 750 export UNICODE_TOOLS=~/unitools/mine/src 751 752 Elango: 753 754 export UNIDATA_ROOT=~/oss/unidata 755 export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/snapshot 756 export CLDR_SRC=~/oss/cldr/mine/src 757 export ICU_ROOT=~/oss/icu 758 export ICU_SRC=$ICU_ROOT 759 export ICU_OUT=$ICU_ROOT 760 export ICUDT=icudt74b 761 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 762 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 763 export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 764 export UNICODE_TOOLS=~/oss/unicodetools/mine/src 765 766 *** Unicode version numbers 767 - makedata.mak 768 - uchar.h 769 - com.ibm.icu.util.VersionInfo 770 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 771 772 *** Configure: Build Unicode data for ICU4J 773 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 774 so that the makefiles see the new version number. 775 cd $ICU_OUT/icu4c 776 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 777 778 *** data files & enums & parser code 779 780 * download files 781 - same as for the early Unicode Tools setup and data refresh: 782 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 783 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 784 - mkdir -p $UNICODE_DATA 785 - download Unicode files into $UNICODE_DATA 786 + new since Unicode 15.1: 787 for the pre-release (alpha, beta) data files, 788 download all of https://www.unicode.org/Public/draft/ 789 (you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders) 790 + if one of us produces the alpha.zip or beta.zip collection of data files for publication, 791 then we can use its contents directly (no FTP from unicode.org necessary) 792 + for final-release data files, the source of truth are the files in 793 https://www.unicode.org/Public/(version) [=UCD], 794 https://www.unicode.org/Public/UCA/(version), 795 https://www.unicode.org/Public/idna/(version), 796 etc. 797 + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc. 798 + subfolders: emoji, idna, security, ucd, uca 799 + whichever way you download the files: 800 ~ inside ucd: extract Unihan.zip to "here" (.../UCD/ucd/Unihan/*.txt), delete Unihan.zip 801 ~ split Unihan into single-property files 802 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/UCD/ucd/Unihan 803 ~ FYI: for updating ICU, we do not actually need Unihan.zip contents 804 + alternate way of fetching files, if available: 805 copy the files from a Unicode Tools workspace that is up to date with 806 https://github.com/unicode-org/unicodetools 807 and which might at this point be *ahead* of "Public" 808 ~ before the Unicode release copy files from "dev" subfolders, for example 809 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 810 - get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already) 811 or from the UCD/cldr/ output folder of the Unicode Tools: 812 From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73, 813 CLDR used modified grapheme break rules. 814 This might happen again. 815 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 816 or 817 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 818 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 819 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 820 + Done: figure out whether we need a CLDR version of LineBreakTest.txt: 821 unicodetools issue #492 822 We should have had one, and instead rbbitst.cpp has "known issue" exception. 823 Unicode 16 and CLDR 46 might get back to having the same behavior. 824 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 825 + done in ICU 76: modify preparseucd.py to copy this file 826 827 * Note: Since Unicode 15.1, data files are no longer published with version suffixes 828 even during the alpha or beta. 829 Thus we no longer need steps & tools to remove those suffixes. 830 (remove this note next time) 831 832 * process and/or copy files 833 - cd $ICU_SRC/tools/unicode 834 py/preparseucd.py $UNICODE_DATA $ICU_SRC 835 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 836 + For debugging, and tweaking how ppucd.txt is written, 837 the tool has an --only_ppucd option: 838 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 839 840 * new constants for new property values 841 - preparseucd.py error: 842 ValueError: missing uchar.h enum constants for some property values: [('blk', {'CJK_Ext_I'}), ('lb', {'VF', 'VI', 'AS', 'AK', 'AP'})] 843 = PropertyValueAliases.txt new property values (diff old & new .txt files) 844 cd $UNIDATA_ROOT 845 $ diff -u uni15.0/ucd/PropertyValueAliases.txt uni15.1/snapshot/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 846 +age; 15.1 ; V15_1 847 +blk; CJK_Ext_I ; CJK_Unified_Ideographs_Extension_I 848 +IDSU; N ; No ; F ; False 849 +IDSU; Y ; Yes ; T ; True 850 +ID_Compat_Math_Continue; N ; No ; F ; False 851 +ID_Compat_Math_Continue; Y ; Yes ; T ; True 852 +ID_Compat_Math_Start; N ; No ; F ; False 853 +ID_Compat_Math_Start; Y ; Yes ; T ; True 854 +lb ; AK ; Aksara 855 +lb ; AP ; Aksara_Prebase 856 +lb ; AS ; Aksara_Start 857 +lb ; VF ; Virama_Final 858 +lb ; VI ; Virama 859 -> add new blocks to uchar.h before UBLOCK_COUNT 860 use long property names for enum constants, 861 for the trailing comment get the block start code point: diff old & new Blocks.txt 862 cd $UNIDATA_ROOT 863 $ diff -u uni15.0/ucd/Blocks.txt uni15.1/snapshot/UCD/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 864 +2EBF0..2EE4F; CJK Unified Ideographs Extension I 865 (ignore blocks whose end code point changed) 866 -> add new blocks to UCharacter.UnicodeBlock IDs 867 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 868 replace public static final int \1_ID = \2; \3 869 -> add new blocks to UCharacter.UnicodeBlock objects 870 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 871 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 872 -> add new line break values to uchar.h & UCharacter.LineBreak 873 874 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 875 (not strictly necessary for NOT_ENCODED scripts) 876 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 877 878 * build ICU 879 to make sure that there are no syntax errors 880 881 $ICU_OUT/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 882 883 * update spoof checker UnicodeSet initializers: 884 inclusionPat & recommendedPat in i18n/uspoof.cpp 885 INCLUSION & RECOMMENDED in SpoofChecker.java 886 - make sure that the Unicode Tools tree contains the latest security data files 887 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 888 - run the tool (no special environment variables needed) 889 cd $UNICODE_TOOLS 890 mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.tools.RecommendedSetGenerator" \ 891 -Dexec.args="" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) 892 - copy & paste from the Console output into the .cpp & .java files 893 894 * check hardcoded IDS_Unary_Operator 895 - new in Unicode 15.1, hardcoded because trivial, and unlikely to change 896 - check that it has not changed: 897 (cd $UNICODE_DATA && grep -r --include=PropList.txt IDS_Unary_Operator) 898 - if it has changed, then update the implementation and the tests 899 - Since ICU 75, this property is tested in C++ intltest against ppucd.txt. 900 901 * check hardcoded ID_Compat_Math_Start & ID_Compat_Math_Continue 902 - new in Unicode 15.1, hardcoded because trivial, and unlikely to change 903 - check that they have not changed: 904 (cd $UNICODE_DATA && grep -r --include=PropList.txt ID_Compat_Math) 905 - if they have changed, then update the implementation and the tests 906 - Since ICU 75, these properties are tested in C++ intltest against ppucd.txt. 907 908 * Bazel build process 909 910 See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 911 for an overview and for setup instructions. 912 913 Consider running `bazelisk --version` outside of the $ICU_SRC folder 914 to find out the latest `bazel` version, and 915 copying that version number into the $ICU_SRC/.bazeliskrc config file. 916 (Revert if you find incompatibilities, or, better, update our build & config files.) 917 918 * generate data files 919 920 - remember to define the environment variables 921 (see the start of the section for this Unicode version) 922 - cd $ICU_SRC 923 - optional but not necessary: 924 bazelisk clean 925 or even 926 bazelisk clean --expunge 927 - build/bootstrap/generate new files: 928 icu4c/source/data/unidata/generate.sh 929 930 * Since Unicode 15.1, the UTS #46 data derivation no longer looks at the decompositions (NFD). 931 These characters are now just valid, no longer disallowed_STD3_valid. 932 Remove special handling of U+2260, U+226E, U+226F (isNonASCIIDisallowedSTD3Valid()) 933 from uts46.cpp & UTS46.java, 934 and special test code from uts46test.cpp & UTS46Test.java. 935 (remove this section next time) 936 937 * run & fix ICU4C tests 938 - Note: Some of the collation data and test data will be updated below, 939 so at this time we might get some collation test failures. 940 Ignore these for now. 941 - fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 942 - update CLDR GraphemeBreakTest.txt 943 cd ~/unitools/mine/Generated 944 cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 945 cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 946 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 947 - Robin or Andy helps with RBBI & spoof check test failures 948 949 * collation: CLDR collation root, UCA DUCET 950 951 - UCA DUCET goes into Mark's Unicode tools, 952 and a tool-tailored version goes into CLDR, see 953 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 954 955 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 956 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 957 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 958 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 959 (note removing the underscore before "Rules") 960 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 961 - restore TODO diffs in UCARules.txt 962 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 963 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 964 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 965 from the CLDR root files (..._CLDR_..._SHORT.txt) 966 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 967 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 968 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 969 - if CLDR common/uca/unihan-index.txt changes, then update 970 CLDR common/collation/root.xml <collation type="private-unihan"> 971 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 972 973 - generate data files, as above (generate.sh), now to pick up new collation data 974 - update CollationFCD.java: 975 copy & paste the initializers of lcccIndex[] etc. from 976 ICU4C/source/i18n/collationfcd.cpp to 977 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 978 - rebuild ICU4C (make clean, make check, as usual) 979 980 * Unihan collators 981 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 982 - run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 983 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 984 - generate ICU zh collation data 985 instructions inspired by 986 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 987 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 988 + setup: 989 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 990 (didn't work without setting JAVA_HOME, 991 nor with the Google default of /usr/local/buildtools/java/jdk 992 [Google security limitations in the XML parser]) 993 export TOOLS_ROOT=$ICU_SRC/tools 994 export CLDR_DIR=$CLDR_SRC 995 export CLDR_DATA_DIR=$CLDR_DIR 996 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 997 cd "$TOOLS_ROOT/cldr/lib" 998 ./install-cldr-jars.sh "$CLDR_DIR" 999 + generate the files we need 1000 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 1001 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 1002 + diff 1003 cd $ICU_SRC 1004 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 1005 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 1006 + copy into the source tree 1007 cd $ICU_SRC 1008 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 1009 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 1010 - rebuild ICU4C 1011 1012 * run & fix ICU4C tests, now with new CLDR collation root data 1013 - run all tests with the collation test data *_SHORT.txt or the full files 1014 (the full ones have comments, useful for debugging) 1015 - note on intltest: if collate/UCAConformanceTest fails, then 1016 utility/MultithreadTest/TestCollators will fail as well; 1017 fix the conformance test before looking into the multi-thread test 1018 1019 * update Java data files 1020 - refresh just the UCD/UCA-related/derived files, just to be safe 1021 - see (ICU4C)/source/data/icu4j-readme.txt 1022 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1023 - $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1024 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1025 you need to reconfigure with unicore data; see the "configure" line above. 1026 output: 1027 ... 1028 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1029 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt74b 1030 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b 1031 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt74l.dat ./out/icu4j/icudt74b.dat -s ./out/build/icudt74l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt74b 1032 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b" 1033 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt74b/ 1034 mkdir -p /tmp/icu4j/main/shared/data 1035 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1036 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt74b/ 1037 mkdir -p /tmp/icu4j/main/shared/data 1038 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1039 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1040 - copy the binary data files into the ICU4J tree 1041 cd $ICU_OUT/icu4c/data/out/icu4j 1042 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 1043 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/brkitr 1044 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT 1045 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT 1046 cd com/ibm/icu/impl/data/$ICUDT/ 1047 ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT";}' | sh 1048 - The procedure above is very conservative: 1049 It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update. 1050 It avoids dealing with any other discrepancies 1051 between the source and generated data files. 1052 *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C: 1053 $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1054 1055 * refresh Java test .txt files 1056 - copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode 1057 cd $ICU_SRC/icu4c/source/data/unidata 1058 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 1059 cd ../../test/testdata 1060 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 1061 cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 1062 1063 * run & fix ICU4J tests 1064 1065 *** API additions 1066 - send notice to icu-design about new born-@stable API (enum constants etc.) 1067 1068 *** CLDR numbering systems 1069 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1070 for example: 1071 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 1072 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.1.txt 1073 ~/icu/uni/src$ diff -u /tmp/icu/nv4-15.txt /tmp/icu/nv4-15.1.txt 1074 --> 1075 (empty this time) 1076 or: 1077 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 1078 --> 1079 (empty this time) 1080 Unicode 15.1: 1081 (none this time) 1082 1083 *** merge the Unicode update branch back onto the main branch 1084 - do not merge the icudata.jar and testdata.jar, 1085 instead rebuild them from merged & tested ICU4C 1086 - if there is a merge conflict in icudata.jar, here is one way to deal with it: 1087 + remove icudata.jar from the commit so that rebasing is trivial 1088 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 1089 + ~/icu/uni/src$ git commit -a --amend 1090 + switch to main, pull updates, switch back to the dev branch 1091 + ~/icu/uni/src$ git rebase main 1092 + rebuild icudata.jar 1093 + ~/icu/uni/src$ git commit -a --amend 1094 + ~/icu/uni/src$ git push -f 1095 - make sure that changes to Unicode tools are checked in: 1096 https://github.com/unicode-org/unicodetools 1097 1098 ---------------------------------------------------------------------------- *** 1099 1100 CLDR 43 root collation update for ICU 73 1101 1102 Partial update only for the root collation. 1103 See 1104 - https://unicode-org.atlassian.net/browse/CLDR-15946 1105 Treat quote marks as equivalent when strength=UCOL_PRIMARY 1106 - https://github.com/unicode-org/cldr/pull/2691 1107 CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks 1108 - https://github.com/unicode-org/cldr/pull/2833 1109 CLDR-15946 make fancy quotes secondary-different from each other 1110 1111 The related changes to tailorings were already integrated in an earlier PR for 1112 https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS. 1113 1114 This update is for the root collation, 1115 which is handled by different tools than the locale data updates. 1116 1117 * Command-line environment setup 1118 1119 export UNICODE_DATA=~/unidata/uni15/20220830 1120 export CLDR_SRC=~/cldr/uni/src 1121 export ICU_ROOT=~/icu/uni 1122 export ICU_SRC=$ICU_ROOT/src 1123 export ICUDT=icudt73b 1124 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1125 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1126 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1127 1128 *** Configure: Build Unicode data for ICU4J 1129 cd $ICU_ROOT/dbg/icu4c 1130 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1131 1132 * Bazel build process 1133 1134 See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 1135 for an overview and for setup instructions. 1136 1137 Consider running `bazelisk --version` outside of the $ICU_SRC folder 1138 to find out the latest `bazel` version, and 1139 copying that version number into the $ICU_SRC/.bazeliskrc config file. 1140 (Revert if you find incompatibilities, or, better, update our build & config files.) 1141 1142 * generate data files 1143 1144 - remember to define the environment variables 1145 (see the start of the section for this Unicode version) 1146 - cd $ICU_SRC 1147 - optional but not necessary: 1148 bazelisk clean 1149 or even 1150 bazelisk clean --expunge 1151 - build/bootstrap/generate new files: 1152 icu4c/source/data/unidata/generate.sh 1153 1154 * collation: CLDR collation root, UCA DUCET 1155 1156 - UCA DUCET goes into Mark's Unicode tools, 1157 and a tool-tailored version goes into CLDR, see 1158 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 1159 1160 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1161 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1162 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1163 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1164 (note removing the underscore before "Rules") 1165 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1166 - restore TODO diffs in UCARules.txt 1167 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1168 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 1169 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1170 from the CLDR root files (..._CLDR_..._SHORT.txt) 1171 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1172 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1173 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1174 - if CLDR common/uca/unihan-index.txt changes, then update 1175 CLDR common/collation/root.xml <collation type="private-unihan"> 1176 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1177 1178 - generate data files, as above (generate.sh), now to pick up new collation data 1179 - rebuild ICU4C (make clean, make check, as usual) 1180 1181 * run & fix ICU4C tests, now with new CLDR collation root data 1182 - run all tests with the collation test data *_SHORT.txt or the full files 1183 (the full ones have comments, useful for debugging) 1184 - note on intltest: if collate/UCAConformanceTest fails, then 1185 utility/MultithreadTest/TestCollators will fail as well; 1186 fix the conformance test before looking into the multi-thread test 1187 1188 * update Java data files 1189 - refresh just the UCD/UCA-related/derived files, just to be safe 1190 - see (ICU4C)/source/data/icu4j-readme.txt 1191 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1192 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1193 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1194 you need to reconfigure with unicore data; see the "configure" line above. 1195 output: 1196 ... 1197 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1198 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b 1199 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b 1200 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b 1201 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b" 1202 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/ 1203 mkdir -p /tmp/icu4j/main/shared/data 1204 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1205 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/ 1206 mkdir -p /tmp/icu4j/main/shared/data 1207 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1208 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1209 - copy the big-endian Unicode data files to another location, 1210 separate from the other data files, 1211 and then refresh ICU4J 1212 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1213 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1214 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1215 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1216 - new for ICU 73: also copy the binary data files directly into the ICU4J tree 1217 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 1218 1219 * When refreshing all of ICU4J data from ICU4C 1220 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1221 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1222 or 1223 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1224 1225 * refresh Java test .txt files 1226 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1227 cd $ICU_SRC/icu4c/source/data/unidata 1228 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1229 cd ../../test/testdata 1230 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1231 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1232 1233 * run & fix ICU4J tests 1234 1235 *** merge the Unicode update branch back onto the main branch 1236 - do not merge the icudata.jar and testdata.jar, 1237 instead rebuild them from merged & tested ICU4C 1238 - if there is a merge conflict in icudata.jar, here is one way to deal with it: 1239 + remove icudata.jar from the commit so that rebasing is trivial 1240 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 1241 + ~/icu/uni/src$ git commit -a --amend 1242 + switch to main, pull updates, switch back to the dev branch 1243 + ~/icu/uni/src$ git rebase main 1244 + rebuild icudata.jar 1245 + ~/icu/uni/src$ git commit -a --amend 1246 + ~/icu/uni/src$ git push -f 1247 - make sure that changes to Unicode tools are checked in: 1248 https://github.com/unicode-org/unicodetools 1249 1250 ---------------------------------------------------------------------------- *** 1251 1252 Unicode 15.0 update for ICU 72 1253 1254 https://www.unicode.org/versions/Unicode15.0.0/ 1255 https://www.unicode.org/versions/beta-15.0.0.html 1256 https://www.unicode.org/Public/15.0.0/ucd/ 1257 https://www.unicode.org/reports/uax-proposed-updates.html 1258 https://www.unicode.org/reports/tr44/tr44-29.html 1259 1260 https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15 1261 https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15 1262 https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41) 1263 1264 * Command-line environment setup 1265 1266 export UNICODE_DATA=~/unidata/uni15/20220830 1267 export CLDR_SRC=~/cldr/uni/src 1268 export ICU_ROOT=~/icu/uni 1269 export ICU_SRC=$ICU_ROOT/src 1270 export ICUDT=icudt72b 1271 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1272 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1273 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1274 1275 *** Unicode version numbers 1276 - makedata.mak 1277 - uchar.h 1278 - com.ibm.icu.util.VersionInfo 1279 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1280 1281 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1282 so that the makefiles see the new version number. 1283 cd $ICU_ROOT/dbg/icu4c 1284 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1285 1286 *** data files & enums & parser code 1287 1288 * download files 1289 - same as for the early Unicode Tools setup and data refresh: 1290 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 1291 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 1292 - mkdir -p $UNICODE_DATA 1293 - download Unicode files into $UNICODE_DATA 1294 + subfolders: emoji, idna, security, ucd, uca 1295 + old way of fetching files: from the "Public" area on unicode.org 1296 ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1297 ~ split Unihan into single-property files 1298 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1299 + new way of fetching files, if available: 1300 copy the files from a Unicode Tools workspace that is up to date with 1301 https://github.com/unicode-org/unicodetools 1302 and which might at this point be *ahead* of "Public" 1303 ~ before the Unicode release copy files from "dev" subfolders, for example 1304 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 1305 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1306 or from the UCD/cldr/ output folder of the Unicode Tools: 1307 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1308 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1309 or 1310 cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 1311 1312 * for manual diffs and for Unicode Tools input data updates: 1313 remove version suffixes from the file names 1314 ~$ unidata/desuffixucd.py $UNICODE_DATA 1315 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 1316 1317 * process and/or copy files 1318 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1319 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1320 + For debugging, and tweaking how ppucd.txt is written, 1321 the tool has an --only_ppucd option: 1322 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1323 1324 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1325 1326 * new constants for new property values 1327 - preparseucd.py error: 1328 ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})] 1329 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1330 ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 1331 +age; 15.0 ; V15_0 1332 +blk; Arabic_Ext_C ; Arabic_Extended_C 1333 +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H 1334 +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D 1335 +blk; Devanagari_Ext_A ; Devanagari_Extended_A 1336 +blk; Kaktovik_Numerals ; Kaktovik_Numerals 1337 +blk; Kawi ; Kawi 1338 +blk; Nag_Mundari ; Nag_Mundari 1339 +sc ; Kawi ; Kawi 1340 +sc ; Nagm ; Nag_Mundari 1341 -> add new blocks to uchar.h before UBLOCK_COUNT 1342 use long property names for enum constants, 1343 for the trailing comment get the block start code point: diff old & new Blocks.txt 1344 ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 1345 +10EC0..10EFF; Arabic Extended-C 1346 +11B00..11B5F; Devanagari Extended-A 1347 +11F00..11F5F; Kawi 1348 -13430..1343F; Egyptian Hieroglyph Format Controls 1349 +13430..1345F; Egyptian Hieroglyph Format Controls 1350 +1D2C0..1D2DF; Kaktovik Numerals 1351 +1E030..1E08F; Cyrillic Extended-D 1352 +1E4D0..1E4FF; Nag Mundari 1353 +31350..323AF; CJK Unified Ideographs Extension H 1354 (ignore blocks whose end code point changed) 1355 -> add new blocks to UCharacter.UnicodeBlock IDs 1356 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1357 replace public static final int \1_ID = \2; \3 1358 -> add new blocks to UCharacter.UnicodeBlock objects 1359 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1360 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1361 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 1362 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 1363 replace public static final int \1 = \2; \3 1364 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1365 and in com.ibm.icu.dev.test.lang.TestUScript.java 1366 1367 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1368 (not strictly necessary for NOT_ENCODED scripts) 1369 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1370 1371 * build ICU 1372 to make sure that there are no syntax errors 1373 1374 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 1375 1376 * update spoof checker UnicodeSet initializers: 1377 inclusionPat & recommendedPat in i18n/uspoof.cpp 1378 INCLUSION & RECOMMENDED in SpoofChecker.java 1379 - make sure that the Unicode Tools tree contains the latest security data files 1380 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1381 - run the tool (no special environment variables needed) 1382 - copy & paste from the Console output into the .cpp & .java files 1383 1384 * Bazel build process 1385 1386 See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 1387 for an overview and for setup instructions. 1388 1389 Consider running `bazelisk --version` outside of the $ICU_SRC folder 1390 to find out the latest `bazel` version, and 1391 copying that version number into the $ICU_SRC/.bazeliskrc config file. 1392 (Revert if you find incompatibilities, or, better, update our build & config files.) 1393 1394 * generate data files 1395 1396 - remember to define the environment variables 1397 (see the start of the section for this Unicode version) 1398 - cd $ICU_SRC 1399 - optional but not necessary: 1400 bazelisk clean 1401 - build/bootstrap/generate new files: 1402 icu4c/source/data/unidata/generate.sh 1403 1404 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1405 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1406 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1407 ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt 1408 - Unicode 6.0..15.0: U+2260, U+226E, U+226F 1409 - nothing new in this Unicode version, no test file to update 1410 1411 * run & fix ICU4C tests 1412 - Note: Some of the collation data and test data will be updated below, 1413 so at this time we might get some collation test failures. 1414 Ignore these for now. 1415 - fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1416 (no rule changes in Unicode 15) 1417 - update CLDR GraphemeBreakTest.txt 1418 cd ~/unitools/mine/Generated 1419 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1420 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 1421 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 1422 - Andy helps with RBBI & spoof check test failures 1423 1424 * collation: CLDR collation root, UCA DUCET 1425 1426 - UCA DUCET goes into Mark's Unicode tools, 1427 and a tool-tailored version goes into CLDR, see 1428 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 1429 1430 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1431 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1432 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1433 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1434 (note removing the underscore before "Rules") 1435 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1436 - restore TODO diffs in UCARules.txt 1437 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1438 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 1439 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1440 from the CLDR root files (..._CLDR_..._SHORT.txt) 1441 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1442 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1443 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1444 - if CLDR common/uca/unihan-index.txt changes, then update 1445 CLDR common/collation/root.xml <collation type="private-unihan"> 1446 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1447 1448 - generate data files, as above (generate.sh), now to pick up new collation data 1449 - update CollationFCD.java: 1450 copy & paste the initializers of lcccIndex[] etc. from 1451 ICU4C/source/i18n/collationfcd.cpp to 1452 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1453 - rebuild ICU4C (make clean, make check, as usual) 1454 1455 * Unihan collators 1456 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 1457 - run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 1458 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 1459 - generate ICU zh collation data 1460 instructions inspired by 1461 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 1462 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 1463 + setup: 1464 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 1465 (didn't work without setting JAVA_HOME, 1466 nor with the Google default of /usr/local/buildtools/java/jdk 1467 [Google security limitations in the XML parser]) 1468 export TOOLS_ROOT=~/icu/uni/src/tools 1469 export CLDR_DIR=~/cldr/uni/src 1470 export CLDR_DATA_DIR=~/cldr/uni/src 1471 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 1472 cd "$TOOLS_ROOT/cldr/lib" 1473 ./install-cldr-jars.sh "$CLDR_DIR" 1474 + generate the files we need 1475 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 1476 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 1477 + diff 1478 cd $ICU_SRC 1479 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 1480 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 1481 + copy into the source tree 1482 cd $ICU_SRC 1483 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 1484 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 1485 - rebuild ICU4C 1486 1487 * run & fix ICU4C tests, now with new CLDR collation root data 1488 - run all tests with the collation test data *_SHORT.txt or the full files 1489 (the full ones have comments, useful for debugging) 1490 - note on intltest: if collate/UCAConformanceTest fails, then 1491 utility/MultithreadTest/TestCollators will fail as well; 1492 fix the conformance test before looking into the multi-thread test 1493 1494 * update Java data files 1495 - refresh just the UCD/UCA-related/derived files, just to be safe 1496 - see (ICU4C)/source/data/icu4j-readme.txt 1497 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1498 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1499 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1500 you need to reconfigure with unicore data; see the "configure" line above. 1501 output: 1502 ... 1503 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1504 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b 1505 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b 1506 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b 1507 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b" 1508 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/ 1509 mkdir -p /tmp/icu4j/main/shared/data 1510 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1511 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/ 1512 mkdir -p /tmp/icu4j/main/shared/data 1513 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1514 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1515 - copy the big-endian Unicode data files to another location, 1516 separate from the other data files, 1517 and then refresh ICU4J 1518 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1519 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1520 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1521 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1522 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1523 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1524 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1525 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1526 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1527 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1528 1529 * When refreshing all of ICU4J data from ICU4C 1530 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1531 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1532 or 1533 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1534 1535 * refresh Java test .txt files 1536 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1537 cd $ICU_SRC/icu4c/source/data/unidata 1538 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1539 cd ../../test/testdata 1540 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1541 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1542 1543 * run & fix ICU4J tests 1544 1545 *** API additions 1546 - send notice to icu-design about new born-@stable API (enum constants etc.) 1547 1548 *** CLDR numbering systems 1549 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1550 for example: 1551 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 1552 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 1553 ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt 1554 --> 1555 +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1556 +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1557 or: 1558 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 1559 --> 1560 +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE 1561 +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE 1562 Unicode 15: 1563 kawi 11F50..11F59 Kawi 1564 nagm 1E4F0..1E4F9 Nag Mundari 1565 https://github.com/unicode-org/cldr/pull/2041 1566 1567 *** merge the Unicode update branches back onto the trunk 1568 - do not merge the icudata.jar and testdata.jar, 1569 instead rebuild them from merged & tested ICU4C 1570 - if there is a merge conflict in icudata.jar, here is one way to deal with it: 1571 + remove icudata.jar from the commit so that rebasing is trivial 1572 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 1573 + ~/icu/uni/src$ git commit -a --amend 1574 + switch to main, pull updates, switch back to the dev branch 1575 + ~/icu/uni/src$ git rebase main 1576 + rebuild icudata.jar 1577 + ~/icu/uni/src$ git commit -a --amend 1578 + ~/icu/uni/src$ git push -f 1579 - make sure that changes to Unicode tools are checked in: 1580 https://github.com/unicode-org/unicodetools 1581 1582 ---------------------------------------------------------------------------- *** 1583 1584 Unicode 14.0 update for ICU 70 1585 1586 https://www.unicode.org/versions/Unicode14.0.0/ 1587 https://www.unicode.org/versions/beta-14.0.0.html 1588 https://www.unicode.org/Public/14.0.0/ucd/ 1589 https://www.unicode.org/reports/uax-proposed-updates.html 1590 https://www.unicode.org/reports/tr44/tr44-27.html 1591 1592 https://unicode-org.atlassian.net/browse/CLDR-14801 1593 https://unicode-org.atlassian.net/browse/ICU-21635 1594 1595 * Command-line environment setup 1596 1597 export UNICODE_DATA=~/unidata/uni14/20210903 1598 export CLDR_SRC=~/cldr/uni/src 1599 export ICU_ROOT=~/icu/uni 1600 export ICU_SRC=$ICU_ROOT/src 1601 export ICUDT=icudt70b 1602 export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1603 export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1604 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1605 1606 *** Unicode version numbers 1607 - makedata.mak 1608 - uchar.h 1609 - com.ibm.icu.util.VersionInfo 1610 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1611 1612 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1613 so that the makefiles see the new version number. 1614 cd $ICU_ROOT/dbg/icu4c 1615 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1616 1617 *** data files & enums & parser code 1618 1619 * download files 1620 - same as for the early Unicode Tools setup and data refresh: 1621 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 1622 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 1623 - mkdir -p $UNICODE_DATA 1624 - download Unicode files into $UNICODE_DATA 1625 + subfolders: emoji, idna, security, ucd, uca 1626 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1627 + split Unihan into single-property files 1628 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1629 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1630 or from the UCD/cldr/ output folder of the Unicode Tools: 1631 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1632 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1633 or 1634 cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 1635 1636 * for manual diffs and for Unicode Tools input data updates: 1637 remove version suffixes from the file names 1638 ~$ unidata/desuffixucd.py $UNICODE_DATA 1639 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 1640 1641 * process and/or copy files 1642 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1643 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1644 + For debugging, and tweaking how ppucd.txt is written, 1645 the tool has an --only_ppucd option: 1646 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1647 1648 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1649 1650 * new constants for new property values 1651 - preparseucd.py error: 1652 ValueError: missing uchar.h enum constants for some property values: 1653 [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), 1654 (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), 1655 (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] 1656 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1657 ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 1658 +age; 14.0 ; V14_0 1659 +blk; Arabic_Ext_B ; Arabic_Extended_B 1660 +blk; Cypro_Minoan ; Cypro_Minoan 1661 +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B 1662 +blk; Kana_Ext_B ; Kana_Extended_B 1663 +blk; Latin_Ext_F ; Latin_Extended_F 1664 +blk; Latin_Ext_G ; Latin_Extended_G 1665 +blk; Old_Uyghur ; Old_Uyghur 1666 +blk; Tangsa ; Tangsa 1667 +blk; Toto ; Toto 1668 +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A 1669 +blk; Vithkuqi ; Vithkuqi 1670 +blk; Znamenny_Music ; Znamenny_Musical_Notation 1671 +jg ; Thin_Yeh ; Thin_Yeh 1672 +jg ; Vertical_Tail ; Vertical_Tail 1673 +sc ; Cpmn ; Cypro_Minoan 1674 +sc ; Ougr ; Old_Uyghur 1675 +sc ; Tnsa ; Tangsa 1676 +sc ; Toto ; Toto 1677 +sc ; Vith ; Vithkuqi 1678 -> add new blocks to uchar.h before UBLOCK_COUNT 1679 use long property names for enum constants, 1680 for the trailing comment get the block start code point: diff old & new Blocks.txt 1681 ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 1682 +0870..089F; Arabic Extended-B 1683 +10570..105BF; Vithkuqi 1684 +10780..107BF; Latin Extended-F 1685 +10F70..10FAF; Old Uyghur 1686 -11700..1173F; Ahom 1687 +11700..1174F; Ahom 1688 +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A 1689 +12F90..12FFF; Cypro-Minoan 1690 +16A70..16ACF; Tangsa 1691 -18D00..18D8F; Tangut Supplement 1692 +18D00..18D7F; Tangut Supplement 1693 +1AFF0..1AFFF; Kana Extended-B 1694 +1CF00..1CFCF; Znamenny Musical Notation 1695 +1DF00..1DFFF; Latin Extended-G 1696 +1E290..1E2BF; Toto 1697 +1E7E0..1E7FF; Ethiopic Extended-B 1698 (ignore blocks whose end code point changed) 1699 -> add new blocks to UCharacter.UnicodeBlock IDs 1700 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1701 replace public static final int \1_ID = \2; \3 1702 -> add new blocks to UCharacter.UnicodeBlock objects 1703 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1704 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1705 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 1706 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 1707 replace public static final int \1 = \2; \3 1708 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1709 and in com.ibm.icu.dev.test.lang.TestUScript.java 1710 -> add new joining groups to uchar.h & UCharacter.JoiningGroup 1711 1712 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1713 (not strictly necessary for NOT_ENCODED scripts) 1714 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1715 1716 * build ICU 1717 to make sure that there are no syntax errors 1718 1719 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 1720 1721 * update spoof checker UnicodeSet initializers: 1722 inclusionPat & recommendedPat in i18n/uspoof.cpp 1723 INCLUSION & RECOMMENDED in SpoofChecker.java 1724 - make sure that the Unicode Tools tree contains the latest security data files 1725 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1726 - run the tool (no special environment variables needed) 1727 - copy & paste from the Console output into the .cpp & .java files 1728 1729 * Bazel build process 1730 1731 See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 1732 for an overview and for setup instructions. 1733 1734 Consider running `bazelisk --version` outside of the $ICU_SRC folder 1735 to find out the latest `bazel` version, and 1736 copying that version number into the $ICU_SRC/.bazeliskrc config file. 1737 (Revert if you find incompatibilities, or, better, update our build & config files.) 1738 1739 * generate data files 1740 1741 - remember to define the environment variables 1742 (see the start of the section for this Unicode version) 1743 - cd $ICU_SRC 1744 - optional but not necessary: 1745 bazelisk clean 1746 - build/bootstrap/generate new files: 1747 icu4c/source/data/unidata/generate.sh 1748 1749 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1750 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1751 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1752 - Unicode 6.0..14.0: U+2260, U+226E, U+226F 1753 - nothing new in this Unicode version, no test file to update 1754 1755 * run & fix ICU4C tests 1756 - fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1757 - update CLDR GraphemeBreakTest.txt 1758 cd ~/unitools/mine/Generated 1759 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1760 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 1761 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 1762 - Andy helps with RBBI & spoof check test failures 1763 1764 * collation: CLDR collation root, UCA DUCET 1765 1766 - UCA DUCET goes into Mark's Unicode tools, 1767 and a tool-tailored version goes into CLDR, see 1768 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 1769 1770 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1771 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1772 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1773 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1774 (note removing the underscore before "Rules") 1775 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1776 - restore TODO diffs in UCARules.txt 1777 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1778 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 1779 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1780 from the CLDR root files (..._CLDR_..._SHORT.txt) 1781 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1782 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1783 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1784 - if CLDR common/uca/unihan-index.txt changes, then update 1785 CLDR common/collation/root.xml <collation type="private-unihan"> 1786 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1787 1788 - generate data files, as above (generate.sh), now to pick up new collation data 1789 - update CollationFCD.java: 1790 copy & paste the initializers of lcccIndex[] etc. from 1791 ICU4C/source/i18n/collationfcd.cpp to 1792 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1793 - rebuild ICU4C (make clean, make check, as usual) 1794 1795 * Unihan collators 1796 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 1797 - run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 1798 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 1799 - generate ICU zh collation data 1800 instructions inspired by 1801 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 1802 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 1803 + setup: 1804 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 1805 (didn't work without setting JAVA_HOME, 1806 nor with the Google default of /usr/local/buildtools/java/jdk 1807 [Google security limitations in the XML parser]) 1808 export TOOLS_ROOT=~/icu/uni/src/tools 1809 export CLDR_DIR=~/cldr/uni/src 1810 export CLDR_DATA_DIR=~/cldr/uni/src 1811 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 1812 cd "$TOOLS_ROOT/cldr/lib" 1813 ./install-cldr-jars.sh "$CLDR_DIR" 1814 + generate the files we need 1815 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 1816 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 1817 + diff 1818 cd $ICU_SRC 1819 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 1820 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 1821 + copy into the source tree 1822 cd $ICU_SRC 1823 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 1824 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 1825 - rebuild ICU4C 1826 1827 * run & fix ICU4C tests, now with new CLDR collation root data 1828 - run all tests with the collation test data *_SHORT.txt or the full files 1829 (the full ones have comments, useful for debugging) 1830 - note on intltest: if collate/UCAConformanceTest fails, then 1831 utility/MultithreadTest/TestCollators will fail as well; 1832 fix the conformance test before looking into the multi-thread test 1833 1834 * update Java data files 1835 - refresh just the UCD/UCA-related/derived files, just to be safe 1836 - see (ICU4C)/source/data/icu4j-readme.txt 1837 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1838 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1839 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1840 you need to reconfigure with unicore data; see the "configure" line above. 1841 output: 1842 ... 1843 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1844 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b 1845 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b 1846 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b 1847 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" 1848 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ 1849 mkdir -p /tmp/icu4j/main/shared/data 1850 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1851 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ 1852 mkdir -p /tmp/icu4j/main/shared/data 1853 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1854 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1855 - copy the big-endian Unicode data files to another location, 1856 separate from the other data files, 1857 and then refresh ICU4J 1858 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1859 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1860 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1861 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1862 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1863 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1864 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1865 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1866 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1867 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1868 1869 * When refreshing all of ICU4J data from ICU4C 1870 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1871 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1872 or 1873 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1874 1875 * refresh Java test .txt files 1876 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1877 cd $ICU_SRC/icu4c/source/data/unidata 1878 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1879 cd ../../test/testdata 1880 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1881 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1882 1883 * run & fix ICU4J tests 1884 1885 *** API additions 1886 - send notice to icu-design about new born-@stable API (enum constants etc.) 1887 1888 *** CLDR numbering systems 1889 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1890 for example: 1891 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt 1892 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 1893 ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt 1894 --> 1895 +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1896 Unicode 14: 1897 tnsa 16AC0..16AC9 Tangsa 1898 https://github.com/unicode-org/cldr/pull/1326 1899 1900 *** merge the Unicode update branches back onto the trunk 1901 - do not merge the icudata.jar and testdata.jar, 1902 instead rebuild them from merged & tested ICU4C 1903 - make sure that changes to Unicode tools are checked in: 1904 https://github.com/unicode-org/unicodetools 1905 1906 ---------------------------------------------------------------------------- *** 1907 1908 Unicode 13.0 update for ICU 66 1909 1910 https://www.unicode.org/versions/Unicode13.0.0/ 1911 https://www.unicode.org/versions/beta-13.0.0.html 1912 https://www.unicode.org/Public/13.0.0/ucd/ 1913 https://www.unicode.org/reports/uax-proposed-updates.html 1914 https://www.unicode.org/reports/tr44/tr44-25.html 1915 1916 https://unicode-org.atlassian.net/browse/CLDR-13387 1917 https://unicode-org.atlassian.net/browse/ICU-20893 1918 1919 * Command-line environment setup 1920 1921 UNICODE_DATA=~/unidata/uni13/20200212 1922 CLDR_SRC=~/cldr/uni/src 1923 ICU_ROOT=~/icu/uni 1924 ICU_SRC=$ICU_ROOT/src 1925 ICUDT=icudt66b 1926 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1927 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1928 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1929 1930 *** Unicode version numbers 1931 - makedata.mak 1932 - uchar.h 1933 - com.ibm.icu.util.VersionInfo 1934 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1935 1936 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1937 so that the makefiles see the new version number. 1938 cd $ICU_ROOT/dbg/icu4c 1939 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1940 1941 *** data files & enums & parser code 1942 1943 * download files 1944 - mkdir -p $UNICODE_DATA 1945 - download Unicode files into $UNICODE_DATA 1946 + subfolders: emoji, idna, security, ucd, uca 1947 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1948 + split Unihan into single-property files 1949 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1950 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1951 or from the ucd/cldr/ output folder of the Unicode Tools: 1952 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1953 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1954 1955 * for manual diffs and for Unicode Tools input data updates: 1956 remove version suffixes from the file names 1957 ~$ unidata/desuffixucd.py $UNICODE_DATA 1958 (see https://sites.google.com/site/unicodetools/inputdata) 1959 1960 * process and/or copy files 1961 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1962 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1963 + For debugging, and tweaking how ppucd.txt is written, 1964 the tool has an --only_ppucd option: 1965 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1966 1967 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1968 1969 * new constants for new property values 1970 - preparseucd.py error: 1971 ValueError: missing uchar.h enum constants for some property values: 1972 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 1973 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 1974 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 1975 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 1976 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1977 blk; Chorasmian ; Chorasmian 1978 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 1979 blk; Dives_Akuru ; Dives_Akuru 1980 blk; Khitan_Small_Script ; Khitan_Small_Script 1981 blk; Lisu_Sup ; Lisu_Supplement 1982 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 1983 blk; Tangut_Sup ; Tangut_Supplement 1984 blk; Yezidi ; Yezidi 1985 -> add to uchar.h before UBLOCK_COUNT 1986 use long property names for enum constants, 1987 for the trailing comment get the block start code point: diff old & new Blocks.txt 1988 -> add to UCharacter.UnicodeBlock IDs 1989 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1990 replace public static final int \1_ID = \2; \3 1991 -> add to UCharacter.UnicodeBlock objects 1992 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1993 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1994 1995 sc ; Chrs ; Chorasmian 1996 sc ; Diak ; Dives_Akuru 1997 sc ; Kits ; Khitan_Small_Script 1998 sc ; Yezi ; Yezidi 1999 -> uscript.h & com.ibm.icu.lang.UScript 2000 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2001 and in com.ibm.icu.dev.test.lang.TestUScript.java 2002 2003 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 2004 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 2005 2006 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2007 (not strictly necessary for NOT_ENCODED scripts) 2008 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2009 2010 * build ICU (make install) 2011 to make sure that there are no syntax errors, and 2012 so that the tools build can pick up the new definitions from the installed header files. 2013 2014 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2015 2016 * update spoof checker UnicodeSet initializers: 2017 inclusionPat & recommendedPat in i18n/uspoof.cpp 2018 INCLUSION & RECOMMENDED in SpoofChecker.java 2019 - make sure that the Unicode Tools tree contains the latest security data files 2020 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2021 - update the hardcoded version number there in the DIRECTORY path 2022 - run the tool (no special environment variables needed) 2023 - copy & paste from the Console output into the .cpp & .java files 2024 2025 * generate normalization data files 2026 cd $ICU_ROOT/dbg/icu4c 2027 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2028 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2029 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2030 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2031 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2032 2033 * build ICU (make install) 2034 so that the tools build can pick up the new definitions from the installed header files. 2035 2036 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2037 2038 * build Unicode tools using CMake+make 2039 2040 $ICU_SRC/tools/unicode/c/icudefs.txt: 2041 2042 # Location (--prefix) of where ICU was installed. 2043 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2044 # Location of the ICU4C source tree. 2045 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 2046 2047 $ICU_ROOT/dbg$ 2048 mkdir -p tools/unicode/c 2049 cd tools/unicode/c 2050 2051 $ICU_ROOT/dbg/tools/unicode/c$ 2052 cmake ../../../../src/tools/unicode/c 2053 make 2054 2055 * generate core properties data files 2056 $ICU_ROOT/dbg/tools/unicode/c$ 2057 genprops/genprops $ICU_SRC/icu4c 2058 - tool failure: 2059 genprops: Script_Extensions indexes overflow bit field 2060 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 2061 -> uprops.icu data file format : 2062 add two more bits to store a script code or Script_Extensions index 2063 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 2064 - rebuild ICU (make install) & tools 2065 2066 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2067 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2068 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2069 - Unicode 6.0..13.0: U+2260, U+226E, U+226F 2070 - nothing new in this Unicode version, no test file to update 2071 2072 * run & fix ICU4C tests 2073 - fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 2074 - Andy helps with RBBI & spoof check test failures 2075 2076 * collation: CLDR collation root, UCA DUCET 2077 2078 - UCA DUCET goes into Mark's Unicode tools, see 2079 https://sites.google.com/site/unicodetools/home#TOC-UCA 2080 diff the main mapping file, look for bad changes 2081 (for example, more bytes per weight for common characters) 2082 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 2083 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 2084 2085 - CLDR root data files are checked into $CLDR_SRC/common/uca/ 2086 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2087 2088 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2089 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2090 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2091 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2092 (note removing the underscore before "Rules") 2093 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2094 - restore TODO diffs in UCARules.txt 2095 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2096 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 2097 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2098 from the CLDR root files (..._CLDR_..._SHORT.txt) 2099 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2100 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2101 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2102 - if CLDR common/uca/unihan-index.txt changes, then update 2103 CLDR common/collation/root.xml <collation type="private-unihan"> 2104 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2105 2106 - run genuca 2107 $ICU_ROOT/dbg/tools/unicode/c$ 2108 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 2109 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2110 - rebuild ICU4C 2111 2112 * Unihan collators 2113 https://sites.google.com/site/unicodetools/unihan 2114 - run Unicode Tools 2115 org.unicode.draft.GenerateUnihanCollators 2116 with VM arguments 2117 -ea 2118 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2119 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2120 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2121 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 2122 -DUVERSION=13.0.0 2123 - run Unicode Tools 2124 org.unicode.draft.GenerateUnihanCollatorFiles 2125 with the same arguments 2126 - check CLDR diffs 2127 cd $CLDR_SRC 2128 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2129 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2130 - copy to CLDR 2131 cd $CLDR_SRC 2132 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2133 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2134 - run CLDR unit tests, commit to CLDR 2135 - generate ICU zh collation data: run CLDR 2136 org.unicode.cldr.icu.NewLdml2IcuConverter 2137 with program arguments 2138 -t collation 2139 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 2140 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 2141 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 2142 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 2143 zh 2144 and VM arguments 2145 -ea 2146 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 2147 - rebuild ICU4C 2148 2149 * run & fix ICU4C tests, now with new CLDR collation root data 2150 - run all tests with the collation test data *_SHORT.txt or the full files 2151 (the full ones have comments, useful for debugging) 2152 - note on intltest: if collate/UCAConformanceTest fails, then 2153 utility/MultithreadTest/TestCollators will fail as well; 2154 fix the conformance test before looking into the multi-thread test 2155 2156 * update Java data files 2157 - refresh just the UCD/UCA-related/derived files, just to be safe 2158 - see (ICU4C)/source/data/icu4j-readme.txt 2159 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2160 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2161 output: 2162 ... 2163 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2164 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 2165 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 2166 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 2167 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 2168 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 2169 mkdir -p /tmp/icu4j/main/shared/data 2170 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2171 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 2172 mkdir -p /tmp/icu4j/main/shared/data 2173 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2174 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2175 - copy the big-endian Unicode data files to another location, 2176 separate from the other data files, 2177 and then refresh ICU4J 2178 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2179 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2180 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2181 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2182 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2183 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2184 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2185 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2186 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2187 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2188 2189 * When refreshing all of ICU4J data from ICU4C 2190 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2191 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2192 or 2193 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2194 2195 * update CollationFCD.java 2196 + copy & paste the initializers of lcccIndex[] etc. from 2197 ICU4C/source/i18n/collationfcd.cpp to 2198 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2199 2200 * refresh Java test .txt files 2201 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2202 cd $ICU_SRC/icu4c/source/data/unidata 2203 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2204 cd ../../test/testdata 2205 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2206 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2207 2208 * run & fix ICU4J tests 2209 2210 *** API additions 2211 - send notice to icu-design about new born-@stable API (enum constants etc.) 2212 2213 *** CLDR numbering systems 2214 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2215 for example, look for 2216 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 2217 in new blocks (Blocks.txt) 2218 Unicode 13: 2219 diak 11950..11959 Dives_Akuru 2220 2221 *** merge the Unicode update branches back onto the trunk 2222 - do not merge the icudata.jar and testdata.jar, 2223 instead rebuild them from merged & tested ICU4C 2224 - make sure that changes to Unicode tools are checked in: 2225 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2226 2227 ---------------------------------------------------------------------------- *** 2228 2229 Unicode 12.1 update for ICU 64.2 2230 2231 ** This is an abbreviated update with one new character for the new 2232 ** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 2233 https://en.wikipedia.org/wiki/Reiwa_period 2234 2235 http://www.unicode.org/versions/Unicode12.1.0/ 2236 2237 ICU-20497 Unicode 12.1 2238 2239 cldrbug 11978: Unicode 12.1 2240 2241 * Command-line environment setup 2242 2243 UNICODE_DATA=~/unidata/uni121/20190403 2244 CLDR_SRC=~/svn.cldr/uni 2245 ICU_ROOT=~/icu/uni 2246 ICU_SRC=$ICU_ROOT/src 2247 ICUDT=icudt64b 2248 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2249 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2250 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2251 2252 *** Unicode version numbers 2253 - makedata.mak 2254 - uchar.h 2255 - com.ibm.icu.util.VersionInfo 2256 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2257 2258 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2259 so that the makefiles see the new version number. 2260 cd $ICU_ROOT/dbg/icu4c 2261 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 2262 2263 *** data files & enums & parser code 2264 2265 * download files 2266 - mkdir -p $UNICODE_DATA 2267 - download Unicode files into $UNICODE_DATA 2268 + subfolders: emoji, idna, security, ucd, uca 2269 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2270 2271 * for manual diffs and for Unicode Tools input data updates: 2272 remove version suffixes from the file names 2273 ~$ unidata/desuffixucd.py $UNICODE_DATA 2274 (see https://sites.google.com/site/unicodetools/inputdata) 2275 2276 * process and/or copy files 2277 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2278 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2279 + For debugging, and tweaking how ppucd.txt is written, 2280 the tool has an --only_ppucd option: 2281 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2282 2283 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2284 2285 * build ICU (make install) 2286 so that the tools build can pick up the new definitions from the installed header files. 2287 2288 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2289 2290 * update spoof checker UnicodeSet initializers: 2291 inclusionPat & recommendedPat in uspoof.cpp 2292 INCLUSION & RECOMMENDED in SpoofChecker.java 2293 - make sure that the Unicode Tools tree contains the latest security data files 2294 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2295 - update the hardcoded version number there in the DIRECTORY path 2296 - run the tool (no special environment variables needed) 2297 - copy & paste from the Console output into the .cpp & .java files 2298 2299 * generate normalization data files 2300 cd $ICU_ROOT/dbg/icu4c 2301 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2302 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2303 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2304 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2305 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2306 2307 * build ICU (make install) 2308 so that the tools build can pick up the new definitions from the installed header files. 2309 2310 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2311 2312 * build Unicode tools using CMake+make 2313 2314 $ICU_SRC/tools/unicode/c/icudefs.txt: 2315 2316 # Location (--prefix) of where ICU was installed. 2317 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2318 # Location of the ICU4C source tree. 2319 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 2320 2321 $ICU_ROOT/dbg$ 2322 mkdir -p tools/unicode/c 2323 cd tools/unicode/c 2324 2325 $ICU_ROOT/dbg/tools/unicode/c$ 2326 cmake ../../../../src/tools/unicode/c 2327 make 2328 2329 * generate core properties data files 2330 $ICU_ROOT/dbg/tools/unicode/c$ 2331 genprops/genprops $ICU_SRC/icu4c 2332 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 2333 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2334 - rebuild ICU (make install) & tools 2335 2336 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2337 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2338 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2339 - Unicode 6.0..12.1: U+2260, U+226E, U+226F 2340 - nothing new in this Unicode version, no test file to update 2341 2342 * run & fix ICU4C tests 2343 - Andy handles RBBI & spoof check test failures 2344 2345 * collation: CLDR collation root, UCA DUCET 2346 2347 - UCA DUCET goes into Mark's Unicode tools, see 2348 https://sites.google.com/site/unicodetools/home#TOC-UCA 2349 diff the main mapping file, look for bad changes 2350 (for example, more bytes per weight for common characters) 2351 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 2352 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 2353 2354 - CLDR root data files are checked into $CLDR_SRC/common/uca/ 2355 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2356 2357 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2358 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2359 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2360 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2361 (note removing the underscore before "Rules") 2362 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2363 - restore TODO diffs in UCARules.txt 2364 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2365 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 2366 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2367 from the CLDR root files (..._CLDR_..._SHORT.txt) 2368 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2369 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2370 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2371 - if CLDR common/uca/unihan-index.txt changes, then update 2372 CLDR common/collation/root.xml <collation type="private-unihan"> 2373 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2374 2375 - run genuca, see command line above 2376 - rebuild ICU4C 2377 2378 * Unihan collators 2379 https://sites.google.com/site/unicodetools/unihan 2380 - run Unicode Tools 2381 org.unicode.draft.GenerateUnihanCollators 2382 with VM arguments 2383 -ea 2384 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2385 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2386 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2387 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2388 -DUVERSION=12.1.0 2389 - run Unicode Tools 2390 org.unicode.draft.GenerateUnihanCollatorFiles 2391 with the same arguments 2392 - check CLDR diffs 2393 cd $CLDR_SRC 2394 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2395 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2396 - copy to CLDR 2397 cd $CLDR_SRC 2398 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2399 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2400 - run CLDR unit tests, commit to CLDR 2401 - generate ICU zh collation data: run CLDR 2402 org.unicode.cldr.icu.NewLdml2IcuConverter 2403 with program arguments 2404 -t collation 2405 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2406 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2407 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 2408 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 2409 zh 2410 and VM arguments 2411 -ea 2412 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2413 - rebuild ICU4C 2414 2415 * run & fix ICU4C tests, now with new CLDR collation root data 2416 - run all tests with the collation test data *_SHORT.txt or the full files 2417 (the full ones have comments, useful for debugging) 2418 - note on intltest: if collate/UCAConformanceTest fails, then 2419 utility/MultithreadTest/TestCollators will fail as well; 2420 fix the conformance test before looking into the multi-thread test 2421 2422 * update Java data files 2423 - refresh just the UCD/UCA-related/derived files, just to be safe 2424 - see (ICU4C)/source/data/icu4j-readme.txt 2425 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2426 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2427 output: 2428 ... 2429 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2430 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 2431 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 2432 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 2433 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 2434 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 2435 mkdir -p /tmp/icu4j/main/shared/data 2436 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2437 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 2438 mkdir -p /tmp/icu4j/main/shared/data 2439 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2440 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2441 - copy the big-endian Unicode data files to another location, 2442 separate from the other data files, 2443 and then refresh ICU4J 2444 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2445 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2446 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2447 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2448 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2449 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2450 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2451 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2452 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2453 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2454 2455 * When refreshing all of ICU4J data from ICU4C 2456 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2457 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2458 or 2459 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2460 2461 * update CollationFCD.java 2462 + copy & paste the initializers of lcccIndex[] etc. from 2463 ICU4C/source/i18n/collationfcd.cpp to 2464 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2465 2466 * refresh Java test .txt files 2467 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2468 cd $ICU_SRC/icu4c/source/data/unidata 2469 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2470 cd ../../test/testdata 2471 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2472 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2473 2474 * run & fix ICU4J tests 2475 2476 *** API additions 2477 - send notice to icu-design about new born-@stable API (enum constants etc.) 2478 2479 *** CLDR numbering systems 2480 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2481 for example, look for 2482 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 2483 in new blocks (Blocks.txt) 2484 Unicode 12: using Unicode 12 CLDR ticket #11478 2485 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 2486 wcho 1E2F0..1E2F9 Wancho 2487 Unicode 11: using Unicode 11 CLDR ticket #10978 2488 rohg 10D30..10D39 Hanifi_Rohingya 2489 gong 11DA0..11DA9 Gunjala_Gondi 2490 Earlier: CLDR tickets specific to adding new numbering systems. 2491 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2492 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2493 2494 *** merge the Unicode update branches back onto the trunk 2495 - do not merge the icudata.jar and testdata.jar, 2496 instead rebuild them from merged & tested ICU4C 2497 - make sure that changes to Unicode tools are checked in: 2498 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2499 2500 ---------------------------------------------------------------------------- *** 2501 2502 Unicode 12.0 update for ICU 64 2503 2504 http://www.unicode.org/versions/Unicode12.0.0/ 2505 http://unicode.org/versions/beta-12.0.0.html 2506 https://www.unicode.org/review/pri389/ 2507 http://www.unicode.org/reports/uax-proposed-updates.html 2508 http://www.unicode.org/reports/tr44/tr44-23.html 2509 2510 ICU-20203 Unicode 12 2511 2512 ICU-20111 move text layout properties data into a data file 2513 2514 cldrbug 11478: Unicode 12 2515 Accidentally used ^/trunk instead of ^/branches/markus/uni12 2516 2517 * Command-line environment setup 2518 2519 UNICODE_DATA=~/unidata/uni12/20190309 2520 CLDR_SRC=~/svn.cldr/uni 2521 ICU_ROOT=~/icu/uni 2522 ICU_SRC=$ICU_ROOT/src 2523 ICUDT=icudt63b 2524 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2525 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2526 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2527 2528 *** Unicode version numbers 2529 - makedata.mak 2530 - uchar.h 2531 - com.ibm.icu.util.VersionInfo 2532 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2533 2534 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2535 so that the makefiles see the new version number. 2536 2537 *** data files & enums & parser code 2538 2539 * download files 2540 - mkdir -p $UNICODE_DATA 2541 - download Unicode files into $UNICODE_DATA 2542 + subfolders: emoji, idna, security, ucd, uca 2543 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2544 2545 * for manual diffs and for Unicode Tools input data updates: 2546 remove version suffixes from the file names 2547 ~$ unidata/desuffixucd.py $UNICODE_DATA 2548 (see https://sites.google.com/site/unicodetools/inputdata) 2549 2550 * process and/or copy files 2551 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2552 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2553 + For debugging, and tweaking how ppucd.txt is written, 2554 the tool has an --only_ppucd option: 2555 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2556 2557 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2558 2559 * build ICU (make install) 2560 so that the tools build can pick up the new definitions from the installed header files. 2561 2562 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2563 2564 * new constants for new property values 2565 - preparseucd.py error: 2566 ValueError: missing uchar.h enum constants for some property values: 2567 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 2568 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 2569 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 2570 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 2571 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2572 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 2573 blk; Elymaic ; Elymaic 2574 blk; Nandinagari ; Nandinagari 2575 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 2576 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 2577 blk; Small_Kana_Ext ; Small_Kana_Extension 2578 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 2579 blk; Tamil_Sup ; Tamil_Supplement 2580 blk; Wancho ; Wancho 2581 -> add to uchar.h 2582 use long property names for enum constants, 2583 for the trailing comment get the block start code point: diff old & new Blocks.txt 2584 -> add to UCharacter.UnicodeBlock IDs 2585 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2586 replace public static final int \1_ID = \2; \3 2587 -> add to UCharacter.UnicodeBlock objects 2588 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2589 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 2590 2591 sc ; Elym ; Elymaic 2592 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 2593 sc ; Nand ; Nandinagari 2594 sc ; Wcho ; Wancho 2595 -> uscript.h & com.ibm.icu.lang.UScript 2596 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2597 and in com.ibm.icu.dev.test.lang.TestUScript.java 2598 2599 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2600 (not strictly necessary for NOT_ENCODED scripts) 2601 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2602 2603 * update spoof checker UnicodeSet initializers: 2604 inclusionPat & recommendedPat in uspoof.cpp 2605 INCLUSION & RECOMMENDED in SpoofChecker.java 2606 - make sure that the Unicode Tools tree contains the latest security data files 2607 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2608 - update the hardcoded version number there in the DIRECTORY path 2609 - run the tool (no special environment variables needed) 2610 - copy & paste from the Console output into the .cpp & .java files 2611 2612 * generate normalization data files 2613 cd $ICU_ROOT/dbg/icu4c 2614 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2615 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2616 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2617 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2618 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2619 2620 * build ICU (make install) 2621 so that the tools build can pick up the new definitions from the installed header files. 2622 2623 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2624 2625 * build Unicode tools using CMake+make 2626 2627 $ICU_SRC/tools/unicode/c/icudefs.txt: 2628 2629 # Location (--prefix) of where ICU was installed. 2630 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2631 # Location of the ICU4C source tree. 2632 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 2633 2634 $ICU_ROOT/dbg$ 2635 mkdir -p tools/unicode/c 2636 cd tools/unicode/c 2637 2638 $ICU_ROOT/dbg/tools/unicode/c$ 2639 cmake ../../../../src/tools/unicode/c 2640 make 2641 2642 * generate core properties data files 2643 $ICU_ROOT/dbg/tools/unicode/c$ 2644 genprops/genprops $ICU_SRC/icu4c 2645 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 2646 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2647 - rebuild ICU (make install) & tools 2648 2649 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2650 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2651 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2652 - Unicode 6.0..12.0: U+2260, U+226E, U+226F 2653 - nothing new in this Unicode version, no test file to update 2654 2655 * run & fix ICU4C tests 2656 - update test of default bidi classes: 2657 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 2658 see diffs in DerivedBidiClass.txt 2659 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 2660 + UCharacterTest.java TestIteration() defaultBidi[] 2661 - Andy handles RBBI & spoof check test failures 2662 2663 * collation: CLDR collation root, UCA DUCET 2664 2665 - UCA DUCET goes into Mark's Unicode tools, see 2666 https://sites.google.com/site/unicodetools/home#TOC-UCA 2667 diff the main mapping file, look for bad changes 2668 (for example, more bytes per weight for common characters) 2669 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 2670 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 2671 2672 - CLDR root data files are checked into $CLDR_SRC/common/uca/ 2673 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2674 2675 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2676 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2677 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2678 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2679 (note removing the underscore before "Rules") 2680 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2681 - restore TODO diffs in UCARules.txt 2682 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2683 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 2684 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2685 from the CLDR root files (..._CLDR_..._SHORT.txt) 2686 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2687 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2688 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2689 - if CLDR common/uca/unihan-index.txt changes, then update 2690 CLDR common/collation/root.xml <collation type="private-unihan"> 2691 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2692 2693 - run genuca, see command line above; 2694 deal with 2695 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 2696 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 2697 (add the character to genuca.cpp sampleCharsToScripts[]) 2698 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 2699 and cache its values. 2700 Works as long as the script metadata is updated before the collation data. 2701 - rebuild ICU4C 2702 2703 * Unihan collators 2704 https://sites.google.com/site/unicodetools/unihan 2705 - run Unicode Tools 2706 org.unicode.draft.GenerateUnihanCollators 2707 with VM arguments 2708 -ea 2709 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2710 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2711 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2712 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2713 -DUVERSION=12.0.0 2714 - run Unicode Tools 2715 org.unicode.draft.GenerateUnihanCollatorFiles 2716 with the same arguments 2717 - check CLDR diffs 2718 cd $CLDR_SRC 2719 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2720 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2721 - copy to CLDR 2722 cd $CLDR_SRC 2723 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2724 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2725 - run CLDR unit tests, commit to CLDR 2726 - generate ICU zh collation data: run CLDR 2727 org.unicode.cldr.icu.NewLdml2IcuConverter 2728 with program arguments 2729 -t collation 2730 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2731 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2732 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 2733 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 2734 zh 2735 and VM arguments 2736 -ea 2737 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2738 - rebuild ICU4C 2739 2740 * run & fix ICU4C tests, now with new CLDR collation root data 2741 - run all tests with the collation test data *_SHORT.txt or the full files 2742 (the full ones have comments, useful for debugging) 2743 - note on intltest: if collate/UCAConformanceTest fails, then 2744 utility/MultithreadTest/TestCollators will fail as well; 2745 fix the conformance test before looking into the multi-thread test 2746 2747 * update Java data files 2748 - refresh just the UCD/UCA-related/derived files, just to be safe 2749 - see (ICU4C)/source/data/icu4j-readme.txt 2750 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2751 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2752 output: 2753 ... 2754 Unicode .icu files built to ./out/build/icudt63l 2755 echo timestamp > uni-core-data 2756 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 2757 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 2758 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2759 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 2760 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 2761 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 2762 mkdir -p /tmp/icu4j/main/shared/data 2763 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2764 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 2765 mkdir -p /tmp/icu4j/main/shared/data 2766 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2767 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2768 - copy the big-endian Unicode data files to another location, 2769 separate from the other data files, 2770 and then refresh ICU4J 2771 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2772 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2773 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2774 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2775 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2776 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2777 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2778 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2779 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2780 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2781 2782 * When refreshing all of ICU4J data from ICU4C 2783 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2784 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2785 or 2786 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2787 2788 * update CollationFCD.java 2789 + copy & paste the initializers of lcccIndex[] etc. from 2790 ICU4C/source/i18n/collationfcd.cpp to 2791 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2792 2793 * refresh Java test .txt files 2794 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2795 cd $ICU_SRC/icu4c/source/data/unidata 2796 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2797 cd ../../test/testdata 2798 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2799 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2800 2801 * run & fix ICU4J tests 2802 2803 *** API additions 2804 - send notice to icu-design about new born-@stable API (enum constants etc.) 2805 2806 *** CLDR numbering systems 2807 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2808 for example, look for 2809 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 2810 in new blocks (Blocks.txt) 2811 Unicode 12: using Unicode 12 CLDR ticket #11478 2812 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 2813 wcho 1E2F0..1E2F9 Wancho 2814 Unicode 11: using Unicode 11 CLDR ticket #10978 2815 rohg 10D30..10D39 Hanifi_Rohingya 2816 gong 11DA0..11DA9 Gunjala_Gondi 2817 Earlier: CLDR tickets specific to adding new numbering systems. 2818 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2819 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2820 2821 *** merge the Unicode update branches back onto the trunk 2822 - do not merge the icudata.jar and testdata.jar, 2823 instead rebuild them from merged & tested ICU4C 2824 - make sure that changes to Unicode tools are checked in: 2825 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2826 2827 ---------------------------------------------------------------------------- *** 2828 2829 ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 2830 2831 * Command-line environment setup 2832 2833 UNICODE_DATA=~/unidata/uni11/20180609 2834 CLDR_SRC=~/svn.cldr/uni 2835 ICU_ROOT=~/icu/mine 2836 ICU_SRC=$ICU_ROOT/src 2837 ICUDT=icudt62b 2838 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2839 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2840 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2841 2842 *** Links 2843 2844 https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 2845 https://unicode-org.atlassian.net/browse/ICU-12850 vo 2846 2847 *** data files & enums & parser code 2848 2849 * API additions 2850 - for each of the three new enumerated properties 2851 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 2852 + uchar.h: update UCHAR_INT_LIMIT 2853 + uchar.h: add the enum U<long prop name> 2854 with constants U_<short prop name>_<long value name> 2855 + UProperty.java: add the constant <long prop name> 2856 + UProperty.java: update INT_LIMIT 2857 + UCharacter.java: add the interface <long prop name> 2858 with constants <long value name> 2859 2860 * process and/or copy files 2861 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2862 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2863 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 2864 names and aliases. 2865 + For debugging, and tweaking how ppucd.txt is written, 2866 the tool has an --only_ppucd option: 2867 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2868 2869 * preparseucd.py changes 2870 - add new property short names (uppercase) to _prop_and_value_re 2871 so that ParseUCharHeader() parses the new enum constants 2872 2873 * build ICU (make install) 2874 so that the tools build can pick up the new definitions from the installed header files. 2875 2876 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2877 2878 * build Unicode tools using CMake+make 2879 2880 $ICU_SRC/tools/unicode/c/icudefs.txt: 2881 2882 # Location (--prefix) of where ICU was installed. 2883 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2884 # Location of the ICU4C source tree. 2885 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 2886 2887 $ICU_ROOT/dbg$ 2888 mkdir -p tools/unicode/c 2889 cd tools/unicode/c 2890 2891 $ICU_ROOT/dbg/tools/unicode/c$ 2892 cmake ../../../../../src/tools/unicode/c 2893 make 2894 2895 * generate core properties data files 2896 $ICU_ROOT/dbg/tools/unicode/c$ 2897 genprops/genprops $ICU_SRC/icu4c 2898 - rebuild ICU (make install) & tools 2899 2900 * write data for runtime, hardcoded for now 2901 - add genprops/layoutpropsbuilder.cpp with pieces from sibling files 2902 - generate new icu4c/source/common/ulayout_props_data.h 2903 - for each of the three new enumerated properties 2904 + int property max value 2905 + small, 8-bit UCPTrie 2906 (A small 16-bit trie with bit fields for these three properties 2907 is very nearly the same size as the sum of the three.) 2908 2909 * wire into C++ 2910 - uprops.cpp: #include ulayout_props_data.h 2911 - uprops.cpp: add getInPC() etc. functions 2912 - uprops.cpp: add lines to intProps[], include max values 2913 - uprops.h: add UPropertySource constants 2914 - uprops.cpp: add uprops_addPropertyStarts(src) 2915 - uniset_props.cpp: add to UnicodeSet_initInclusion() 2916 - intltest/ucdtest.cpp: write unit tests 2917 2918 * update Java data files 2919 - refresh just the pnames.icu file with the new property [value] names, just to be safe 2920 - see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 2921 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2922 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2923 - copy the big-endian Unicode data files to another location, 2924 separate from the other data files, 2925 and then refresh ICU4J 2926 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2927 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2928 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2929 2930 * wire into Java 2931 - UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 2932 - UCharacterProperty.java: for each new property 2933 + create a nested class to hold its CodePointTrie 2934 + initialize it from a string literal 2935 + paste in the initializer printed by genprops 2936 + add a new IntProperty object to the intProps[] array 2937 + use the correct max int value for each property, also printed by genprops 2938 - UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 2939 - UnicodeSet.java: add to getInclusions() 2940 - UCharacterTest.java: write unit tests 2941 2942 ---------------------------------------------------------------------------- *** 2943 2944 Unicode 11.0 update for ICU 62 2945 2946 http://www.unicode.org/versions/Unicode11.0.0/ 2947 http://unicode.org/versions/beta-11.0.0.html 2948 https://www.unicode.org/review/pri372/ 2949 http://www.unicode.org/reports/uax-proposed-updates.html 2950 http://www.unicode.org/reports/tr44/tr44-21.html 2951 2952 * Command-line environment setup 2953 2954 UNICODE_DATA=~/unidata/uni11/20180521 2955 CLDR_SRC=~/svn.cldr/uni 2956 ICU_ROOT=~/svn.icu/uni 2957 ICU_SRC=$ICU_ROOT/src 2958 ICUDT=icudt61b 2959 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2960 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2961 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2962 2963 *** ICU Trac 2964 2965 - ticket:13630: Unicode 11 2966 - ^/branches/markus/uni11 2967 2968 *** CLDR Trac 2969 2970 - cldrbug 10978: Unicode 11 2971 - ^/branches/markus/uni11 2972 2973 *** Unicode version numbers 2974 - makedata.mak 2975 - uchar.h 2976 - com.ibm.icu.util.VersionInfo 2977 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2978 2979 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2980 so that the makefiles see the new version number. 2981 2982 *** data files & enums & parser code 2983 2984 * download files 2985 - mkdir -p $UNICODE_DATA 2986 - download Unicode files into $UNICODE_DATA 2987 + subfolders: emoji, idna, security, ucd, uca 2988 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2989 2990 * for manual diffs and for Unicode Tools input data updates: 2991 remove version suffixes from the file names 2992 ~$ unidata/desuffixucd.py $UNICODE_DATA 2993 (see https://sites.google.com/site/unicodetools/inputdata) 2994 2995 * process and/or copy files 2996 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2997 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2998 + For debugging, and tweaking how ppucd.txt is written, 2999 the tool has an --only_ppucd option: 3000 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 3001 3002 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 3003 3004 * build ICU (make install) 3005 so that the tools build can pick up the new definitions from the installed header files. 3006 3007 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3008 3009 * preparseucd.py changes 3010 - fix other errors 3011 NameError: unknown property Extended_Pictographic 3012 -> add Extended_Pictographic binary property 3013 -> add new short names for all Emoji properties 3014 3015 * new constants for new property values 3016 - preparseucd.py error: 3017 ValueError: missing uchar.h enum constants for some property values: 3018 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 3019 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 3020 u'Indic_Siyaq_Numbers'])), 3021 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 3022 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 3023 (u'GCB', set([u'LinkC', u'Virama'])), 3024 (u'WB', set([u'WSegSpace']))] 3025 = PropertyValueAliases.txt new property values (diff old & new .txt files) 3026 blk; Chess_Symbols ; Chess_Symbols 3027 blk; Dogra ; Dogra 3028 blk; Georgian_Ext ; Georgian_Extended 3029 blk; Gunjala_Gondi ; Gunjala_Gondi 3030 blk; Hanifi_Rohingya ; Hanifi_Rohingya 3031 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 3032 blk; Makasar ; Makasar 3033 blk; Mayan_Numerals ; Mayan_Numerals 3034 blk; Medefaidrin ; Medefaidrin 3035 blk; Old_Sogdian ; Old_Sogdian 3036 blk; Sogdian ; Sogdian 3037 -> add to uchar.h 3038 use long property names for enum constants, 3039 for the trailing comment get the block start code point: diff old & new Blocks.txt 3040 -> add to UCharacter.UnicodeBlock IDs 3041 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3042 replace public static final int \1_ID = \2; \3 3043 -> add to UCharacter.UnicodeBlock objects 3044 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3045 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3046 3047 GCB; LinkC ; LinkingConsonant 3048 GCB; Virama ; Virama 3049 -> uchar.h & UCharacter.GraphemeClusterBreak 3050 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 3051 3052 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 3053 -> ignore: ICU does not yet support this property 3054 3055 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 3056 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 3057 -> uchar.h & UCharacter.JoiningGroup 3058 3059 sc ; Dogr ; Dogra 3060 sc ; Gong ; Gunjala_Gondi 3061 sc ; Maka ; Makasar 3062 sc ; Medf ; Medefaidrin 3063 sc ; Rohg ; Hanifi_Rohingya 3064 sc ; Sogd ; Sogdian 3065 sc ; Sogo ; Old_Sogdian 3066 -> uscript.h & com.ibm.icu.lang.UScript 3067 -> Nushu had been added already 3068 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3069 and in com.ibm.icu.dev.test.lang.TestUScript.java 3070 3071 WB ; WSegSpace ; WSegSpace 3072 -> uchar.h & UCharacter.WordBreak 3073 3074 * New short names for emoji properties 3075 - see UTS #51 3076 - short names set in preparseucd.py 3077 3078 * New properties 3079 - boolean emoji property Extended_Pictographic 3080 -> added in preparseucd.py 3081 -> uchar.h & UProperty.java 3082 - misc. property Equivalent_Unified_Ideograph (EqUIdeo) 3083 as shown in PropertyValueAliases.txt 3084 -> ignore for now 3085 3086 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3087 (not strictly necessary for NOT_ENCODED scripts) 3088 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 3089 3090 * update spoof checker UnicodeSet initializers: 3091 inclusionPat & recommendedPat in uspoof.cpp 3092 INCLUSION & RECOMMENDED in SpoofChecker.java 3093 - make sure that the Unicode Tools tree contains the latest security data files 3094 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 3095 - update the hardcoded version number there in the DIRECTORY path 3096 - run the tool (no special environment variables needed) 3097 - copy & paste from the Console output into the .cpp & .java files 3098 3099 * generate normalization data files 3100 cd $ICU_ROOT/dbg/icu4c 3101 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 3102 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 3103 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 3104 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3105 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 3106 3107 * build ICU (make install) 3108 so that the tools build can pick up the new definitions from the installed header files. 3109 3110 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3111 3112 * build Unicode tools using CMake+make 3113 3114 $ICU_SRC/tools/unicode/c/icudefs.txt: 3115 3116 # Location (--prefix) of where ICU was installed. 3117 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 3118 # Location of the ICU4C source tree. 3119 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 3120 3121 $ICU_ROOT/dbg$ 3122 mkdir -p tools/unicode/c 3123 cd tools/unicode/c 3124 3125 $ICU_ROOT/dbg/tools/unicode/c$ 3126 cmake ../../../../src/tools/unicode/c 3127 make 3128 3129 * generate core properties data files 3130 $ICU_ROOT/dbg/tools/unicode/c$ 3131 genprops/genprops $ICU_SRC/icu4c 3132 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 3133 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 3134 - rebuild ICU (make install) & tools 3135 3136 * Fix case props 3137 genprops error: casepropsbuilder: too many exceptions words 3138 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 3139 - With the addition of Georgian Mtavruli capital letters, 3140 there are now too many simple case mappings with big mapping deltas 3141 that yield uncompressible exceptions. 3142 - Changing the data structure (now formatVersion 4), 3143 adding one bit for no-simple-case-folding (for Cherokee), and 3144 one optional slot for a big delta (for most faraway mappings), 3145 together with another bit for whether that is negative. 3146 This makes most Cherokee & Georgian etc. case mappings compressible, 3147 reducing the number of exceptions words. 3148 - Further changes to gain one more bit for the exceptions index, 3149 for future growth. Details see casepropsbuilder.cpp. 3150 3151 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3152 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3153 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3154 - Unicode 6.0..11.0: U+2260, U+226E, U+226F 3155 - nothing new in this Unicode version, no test file to update 3156 3157 * run & fix ICU4C tests 3158 - Andy handles RBBI & spoof check test failures 3159 3160 - Errors in char.txt, word.txt, word_POSIX.txt like 3161 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 3162 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 3163 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 3164 not empty, just to get ICU building. 3165 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 3166 and properties together with the rules that used them (GB 10, WB 14). 3167 -> Andy adjusts the rule sets further to sync with 3168 Unicode 11 grapheme, word, and line break spec changes. 3169 3170 * collation: CLDR collation root, UCA DUCET 3171 3172 - UCA DUCET goes into Mark's Unicode tools, see 3173 https://sites.google.com/site/unicodetools/home#TOC-UCA 3174 diff the main mapping file, look for bad changes 3175 (for example, more bytes per weight for common characters) 3176 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 3177 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 3178 3179 - CLDR root data files are checked into $CLDR_SRC/common/uca/ 3180 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 3181 3182 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3183 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 3184 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3185 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 3186 (note removing the underscore before "Rules") 3187 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 3188 - restore TODO diffs in UCARules.txt 3189 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 3190 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 3191 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3192 from the CLDR root files (..._CLDR_..._SHORT.txt) 3193 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3194 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3195 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 3196 - if CLDR common/uca/unihan-index.txt changes, then update 3197 CLDR common/collation/root.xml <collation type="private-unihan"> 3198 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 3199 3200 - run genuca, see command line above; 3201 deal with 3202 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 3203 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 3204 (add the character to genuca.cpp sampleCharsToScripts[]) 3205 + look up the USCRIPT_ code for the new sample characters 3206 (should be obvious from the comment in the error output) 3207 + *add* mappings to sampleCharsToScripts[], do not replace them 3208 (in case the script sample characters flip-flop) 3209 + insert new scripts in DUCET script order, see the top_byte table 3210 at the beginning of FractionalUCA.txt 3211 - rebuild ICU4C 3212 3213 * Unihan collators 3214 https://sites.google.com/site/unicodetools/unihan 3215 - run Unicode Tools 3216 org.unicode.draft.GenerateUnihanCollators 3217 with VM arguments 3218 -ea 3219 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 3220 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 3221 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 3222 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 3223 -DUVERSION=11.0.0 3224 - run Unicode Tools 3225 org.unicode.draft.GenerateUnihanCollatorFiles 3226 with the same arguments 3227 - check CLDR diffs 3228 cd $CLDR_SRC 3229 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 3230 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 3231 - copy to CLDR 3232 cd $CLDR_SRC 3233 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 3234 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 3235 - run CLDR unit tests, commit to CLDR 3236 - generate ICU zh collation data: run CLDR 3237 org.unicode.cldr.icu.NewLdml2IcuConverter 3238 with program arguments 3239 -t collation 3240 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 3241 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 3242 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 3243 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 3244 zh 3245 and VM arguments 3246 -ea 3247 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 3248 - rebuild ICU4C 3249 3250 * run & fix ICU4C tests, now with new CLDR collation root data 3251 - run all tests with the collation test data *_SHORT.txt or the full files 3252 (the full ones have comments, useful for debugging) 3253 - note on intltest: if collate/UCAConformanceTest fails, then 3254 utility/MultithreadTest/TestCollators will fail as well; 3255 fix the conformance test before looking into the multi-thread test 3256 3257 * update Java data files 3258 - refresh just the UCD/UCA-related/derived files, just to be safe 3259 - see (ICU4C)/source/data/icu4j-readme.txt 3260 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3261 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3262 output: 3263 ... 3264 Unicode .icu files built to ./out/build/icudt61l 3265 echo timestamp > uni-core-data 3266 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 3267 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 3268 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3269 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 3270 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 3271 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 3272 mkdir -p /tmp/icu4j/main/shared/data 3273 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3274 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 3275 mkdir -p /tmp/icu4j/main/shared/data 3276 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3277 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 3278 - copy the big-endian Unicode data files to another location, 3279 separate from the other data files, 3280 and then refresh ICU4J 3281 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 3282 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3283 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3284 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3285 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3286 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3287 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3288 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3289 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3290 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3291 3292 * When refreshing all of ICU4J data from ICU4C 3293 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3294 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 3295 or 3296 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 3297 3298 * update CollationFCD.java 3299 + copy & paste the initializers of lcccIndex[] etc. from 3300 ICU4C/source/i18n/collationfcd.cpp to 3301 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3302 3303 * refresh Java test .txt files 3304 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3305 cd $ICU_SRC/icu4c/source/data/unidata 3306 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3307 cd ../../test/testdata 3308 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3309 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3310 3311 * run & fix ICU4J tests 3312 3313 *** API additions 3314 - send notice to icu-design about new born-@stable API (enum constants etc.) 3315 3316 *** CLDR numbering systems 3317 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 3318 Unicode 11: using Unicode 11 CLDR ticket #10978 3319 rohg 10D30..10D39 Hanifi_Rohingya 3320 gong 11DA0..11DA9 Gunjala_Gondi 3321 Earlier: CLDR tickets specific to adding new numbering systems. 3322 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 3323 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 3324 3325 *** merge the Unicode update branches back onto the trunk 3326 - do not merge the icudata.jar and testdata.jar, 3327 instead rebuild them from merged & tested ICU4C 3328 - make sure that changes to Unicode tools are checked in: 3329 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3330 3331 ---------------------------------------------------------------------------- *** 3332 3333 Unicode 10.0 update for ICU 60 3334 3335 http://www.unicode.org/versions/Unicode10.0.0/ 3336 http://www.unicode.org/versions/beta-10.0.0.html 3337 http://blog.unicode.org/2017/03/unicode-100-beta-review.html 3338 http://www.unicode.org/review/pri350/ 3339 http://www.unicode.org/reports/uax-proposed-updates.html 3340 http://www.unicode.org/reports/tr44/tr44-19.html 3341 3342 * Command-line environment setup 3343 3344 UNICODE_DATA=~/unidata/uni10/20170605 3345 CLDR_SRC=~/svn.cldr/uni10 3346 ICU_ROOT=~/svn.icu/uni10 3347 ICU_SRC=$ICU_ROOT/src 3348 ICUDT=icudt60b 3349 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 3350 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 3351 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 3352 3353 *** ICU Trac 3354 3355 - ticket:12985: Unicode 10 3356 - ticket:13061: undo hacks from emoji 5.0 update 3357 - ticket:13062: add Emoji_Component property 3358 - ^/branches/markus/uni10 3359 3360 *** CLDR Trac 3361 3362 - cldrbug 10055: Unicode 10 3363 - cldrbug 9882: Unicode 10 script metadata 3364 - cldrbug 10219: numbering systems for Unicode 10 3365 3366 *** Unicode version numbers 3367 - makedata.mak 3368 - uchar.h 3369 - com.ibm.icu.util.VersionInfo 3370 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3371 3372 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3373 so that the makefiles see the new version number. 3374 3375 *** data files & enums & parser code 3376 3377 * download files 3378 - mkdir -p $UNICODE_DATA 3379 - download Unicode 10.0 files into $UNICODE_DATA 3380 + subfolders: ucd, uca, idna, security 3381 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3382 - download emoji 5.0 files into $UNICODE_DATA/emoji 3383 3384 * for manual diffs: remove version suffixes from the file names 3385 ~$ unidata/desuffixucd.py $UNICODE_DATA 3386 (see https://sites.google.com/site/unicodetools/inputdata) 3387 3388 * process and/or copy files 3389 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 3390 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3391 + For debugging, and tweaking how ppucd.txt is written, 3392 the tool has an --only_ppucd option: 3393 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 3394 3395 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 3396 3397 * build ICU (make install) 3398 so that the tools build can pick up the new definitions from the installed header files. 3399 3400 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3401 3402 * preparseucd.py changes 3403 - remove or add new Unicode scripts from/to the 3404 only-in-ISO-15924 list according to the error messages: 3405 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 3406 -> adjust _scripts_only_in_iso15924 as indicated 3407 - fix other errors 3408 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 3409 -> add vo=Vertical_Orientation to _ignored_properties 3410 -> later removed again, parsing the file, even though we do not yet store data for runtime use 3411 3412 * new constants for new property values 3413 - preparseucd.py error: 3414 ValueError: missing uchar.h enum constants for some property values: 3415 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 3416 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 3417 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 3418 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 3419 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 3420 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 3421 = PropertyValueAliases.txt new property values (diff old & new .txt files) 3422 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 3423 blk; Kana_Ext_A ; Kana_Extended_A 3424 blk; Masaram_Gondi ; Masaram_Gondi 3425 blk; Nushu ; Nushu 3426 blk; Soyombo ; Soyombo 3427 blk; Syriac_Sup ; Syriac_Supplement 3428 blk; Zanabazar_Square ; Zanabazar_Square 3429 -> add to uchar.h 3430 use long property names for enum constants, 3431 for the trailing comment get the block start code point: diff old & new Blocks.txt 3432 -> add to UCharacter.UnicodeBlock IDs 3433 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3434 replace public static final int \1_ID = \2; \3 3435 -> add to UCharacter.UnicodeBlock objects 3436 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3437 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3438 3439 jg ; Malayalam_Bha ; Malayalam_Bha 3440 jg ; Malayalam_Ja ; Malayalam_Ja 3441 jg ; Malayalam_Lla ; Malayalam_Lla 3442 jg ; Malayalam_Llla ; Malayalam_Llla 3443 jg ; Malayalam_Nga ; Malayalam_Nga 3444 jg ; Malayalam_Nna ; Malayalam_Nna 3445 jg ; Malayalam_Nnna ; Malayalam_Nnna 3446 jg ; Malayalam_Nya ; Malayalam_Nya 3447 jg ; Malayalam_Ra ; Malayalam_Ra 3448 jg ; Malayalam_Ssa ; Malayalam_Ssa 3449 jg ; Malayalam_Tta ; Malayalam_Tta 3450 -> uchar.h & UCharacter.JoiningGroup 3451 3452 sc ; Gonm ; Masaram_Gondi 3453 sc ; Nshu ; Nushu 3454 sc ; Soyo ; Soyombo 3455 sc ; Zanb ; Zanabazar_Square 3456 -> uscript.h & com.ibm.icu.lang.UScript 3457 -> Nushu had been added already 3458 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3459 and in com.ibm.icu.dev.test.lang.TestUScript.java 3460 3461 * New properties as shown in PropertyValueAliases.txt changes 3462 - boolean Emoji_Component from emoji 5 3463 -> uchar.h & UProperty.java 3464 - boolean 3465 # Regional_Indicator (RI) 3466 3467 RI ; N ; No ; F ; False 3468 RI ; Y ; Yes ; T ; True 3469 -> uchar.h & UProperty.java 3470 -> single immutable range, to be hardcoded 3471 - boolean 3472 # Prepended_Concatenation_Mark (PCM) 3473 3474 PCM; N ; No ; F ; False 3475 PCM; Y ; Yes ; T ; True 3476 -> was new in Unicode 9 3477 -> uchar.h & UProperty.java 3478 - enumerated 3479 # Vertical_Orientation (vo) 3480 3481 vo ; R ; Rotated 3482 vo ; Tr ; Transformed_Rotated 3483 vo ; Tu ; Transformed_Upright 3484 vo ; U ; Upright 3485 -> only pre-parsed for now, but not yet stored for runtime use 3486 3487 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3488 (not strictly necessary for NOT_ENCODED scripts) 3489 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 3490 3491 * generate normalization data files 3492 cd $ICU_ROOT/dbg/icu4c 3493 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 3494 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 3495 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 3496 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3497 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 3498 3499 * build ICU (make install) 3500 so that the tools build can pick up the new definitions from the installed header files. 3501 3502 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3503 3504 * build Unicode tools using CMake+make 3505 3506 $ICU_SRC/tools/unicode/c/icudefs.txt: 3507 3508 # Location (--prefix) of where ICU was installed. 3509 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 3510 # Location of the ICU4C source tree. 3511 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 3512 3513 $ICU_ROOT/dbg/tools/unicode/c$ 3514 cmake ../../../../src/tools/unicode/c 3515 make 3516 3517 * generate core properties data files 3518 $ICU_ROOT/dbg/tools/unicode/c$ 3519 genprops/genprops $ICU_SRC/icu4c 3520 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 3521 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 3522 - rebuild ICU (make install) & tools 3523 3524 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3525 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3526 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3527 - Unicode 6.0..10.0: U+2260, U+226E, U+226F 3528 - nothing new in this Unicode version, no test file to update 3529 3530 * run & fix ICU4C tests 3531 - Andy handles RBBI & spoof check test failures 3532 3533 * collation: CLDR collation root, UCA DUCET 3534 3535 - UCA DUCET goes into Mark's Unicode tools, see 3536 https://sites.google.com/site/unicodetools/home#TOC-UCA 3537 - CLDR root data files are checked into $CLDR_SRC/common/uca/ 3538 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 3539 3540 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3541 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 3542 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3543 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 3544 (note removing the underscore before "Rules") 3545 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 3546 - restore TODO diffs in UCARules.txt 3547 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 3548 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 3549 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3550 from the CLDR root files (..._CLDR_..._SHORT.txt) 3551 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3552 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3553 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 3554 - if CLDR common/uca/unihan-index.txt changes, then update 3555 CLDR common/collation/root.xml <collation type="private-unihan"> 3556 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 3557 3558 - run genuca, see command line above; 3559 deal with 3560 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 3561 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 3562 (add the character to genuca.cpp sampleCharsToScripts[]) 3563 + look up the USCRIPT_ code for the new sample characters 3564 (should be obvious from the comment in the error output) 3565 + *add* mappings to sampleCharsToScripts[], do not replace them 3566 (in case the script sample characters flip-flop) 3567 + insert new scripts in DUCET script order, see the top_byte table 3568 at the beginning of FractionalUCA.txt 3569 - rebuild ICU4C 3570 3571 * Unihan collators 3572 https://sites.google.com/site/unicodetools/unihan 3573 - run Unicode Tools 3574 org.unicode.draft.GenerateUnihanCollators 3575 with VM arguments 3576 -ea 3577 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 3578 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 3579 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 3580 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 3581 -DUVERSION=10.0.0 3582 - run Unicode Tools 3583 org.unicode.draft.GenerateUnihanCollatorFiles 3584 with the same arguments 3585 - check CLDR diffs 3586 cd $CLDR_SRC 3587 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 3588 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 3589 - copy to CLDR 3590 cd $CLDR_SRC 3591 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 3592 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 3593 - run CLDR unit tests, commit to CLDR 3594 - generate ICU zh collation data: run CLDR 3595 org.unicode.cldr.icu.NewLdml2IcuConverter 3596 with program arguments 3597 -t collation 3598 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 3599 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 3600 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 3601 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 3602 zh 3603 and VM arguments 3604 -ea 3605 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 3606 - rebuild ICU4C 3607 3608 * run & fix ICU4C tests, now with new CLDR collation root data 3609 - run all tests with the collation test data *_SHORT.txt or the full files 3610 (the full ones have comments, useful for debugging) 3611 - note on intltest: if collate/UCAConformanceTest fails, then 3612 utility/MultithreadTest/TestCollators will fail as well; 3613 fix the conformance test before looking into the multi-thread test 3614 3615 * update Java data files 3616 - refresh just the UCD/UCA-related/derived files, just to be safe 3617 - see (ICU4C)/source/data/icu4j-readme.txt 3618 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3619 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3620 output: 3621 ... 3622 Unicode .icu files built to ./out/build/icudt60l 3623 echo timestamp > uni-core-data 3624 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 3625 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 3626 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3627 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 3628 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 3629 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 3630 mkdir -p /tmp/icu4j/main/shared/data 3631 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3632 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 3633 mkdir -p /tmp/icu4j/main/shared/data 3634 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3635 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 3636 - copy the big-endian Unicode data files to another location, 3637 separate from the other data files, 3638 and then refresh ICU4J 3639 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 3640 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3641 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3642 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3643 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3644 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3645 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3646 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3647 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3648 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3649 3650 * When refreshing all of ICU4J data from ICU4C 3651 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3652 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 3653 or 3654 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 3655 3656 * update CollationFCD.java 3657 + copy & paste the initializers of lcccIndex[] etc. from 3658 ICU4C/source/i18n/collationfcd.cpp to 3659 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3660 3661 * refresh Java test .txt files 3662 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3663 cd $ICU_SRC/icu4c/source/data/unidata 3664 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3665 cd ../../test/testdata 3666 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3667 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3668 3669 * run & fix ICU4J tests 3670 3671 *** API additions 3672 - send notice to icu-design about new born-@stable API (enum constants etc.) 3673 3674 *** CLDR numbering systems 3675 - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 3676 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 3677 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 3678 3679 *** merge the Unicode update branches back onto the trunk 3680 - do not merge the icudata.jar and testdata.jar, 3681 instead rebuild them from merged & tested ICU4C 3682 - make sure that changes to Unicode tools are checked in: 3683 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3684 3685 ---------------------------------------------------------------------------- *** 3686 3687 Emoji 5.0 update for ICU 59 3688 - ICU 59 mostly remains on Unicode 9.0 3689 - except updates bidi and segmentation data to Unicode 10 beta 3690 3691 First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 3692 3693 * Command-line environment setup 3694 3695 ICU_ROOT=~/svn.icu/trunk 3696 ICU_SRC_DIR=$ICU_ROOT/src 3697 ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 3698 ICUDT=icudt59b 3699 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3700 SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 3701 UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 3702 3703 *** ICU Trac 3704 3705 - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 3706 - changes directly on trunk 3707 3708 *** data files & enums & parser code 3709 3710 * download files 3711 3712 - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 3713 - download emoji 5.0 beta files into the same uni90e50 folder 3714 - download Unicode 10.0 beta files: ucd 3715 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 3716 BidiBrackets.txt 3717 BidiCharacterTest.txt 3718 BidiMirroring.txt 3719 BidiTest.txt 3720 extracted/DerivedBidiClass.txt 3721 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 3722 LineBreak.txt 3723 auxiliary/* 3724 3725 * preparseucd.py changes 3726 - adjust for combined trunks 3727 - write new copyright lines 3728 - ignore new Emoji_Component property for now 3729 3730 * process and/or copy files 3731 - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 3732 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3733 3734 - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 3735 3736 * build ICU (make install) 3737 so that the tools build can pick up the new definitions from the installed header files. 3738 3739 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3740 3741 * build Unicode tools using CMake+make 3742 3743 ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 3744 3745 # Location (--prefix) of where ICU was installed. 3746 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 3747 # Location of the ICU4C source tree. 3748 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 3749 3750 ~/svn.icu/trunk/dbg/tools/unicode/c$ 3751 cmake ../../../../src/tools/unicode/c 3752 make 3753 3754 * generate core properties data files 3755 ~/svn.icu/trunk/dbg/tools/unicode/c$ 3756 genprops/genprops $ICU4C_SRC_DIR 3757 - rebuild ICU (make install) & tools 3758 3759 * run & fix ICU4C tests 3760 - Andy handles RBBI & spoof check test failures 3761 3762 * update Java data files 3763 - refresh just the UCD/UCA-related/derived files, just to be safe 3764 - see (ICU4C)/source/data/icu4j-readme.txt 3765 - mkdir /tmp/icu4j 3766 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3767 output: 3768 ... 3769 Unicode .icu files built to ./out/build/icudt59l 3770 echo timestamp > uni-core-data 3771 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 3772 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 3773 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3774 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 3775 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 3776 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 3777 mkdir -p /tmp/icu4j/main/shared/data 3778 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3779 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 3780 mkdir -p /tmp/icu4j/main/shared/data 3781 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3782 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 3783 - copy the big-endian Unicode data files to another location, 3784 separate from the other data files, 3785 and then refresh ICU4J 3786 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 3787 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3788 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3789 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3790 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3791 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3792 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3793 3794 * When refreshing all of ICU4J data from ICU4C 3795 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3796 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 3797 or 3798 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 3799 3800 * refresh Java test .txt files 3801 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3802 cd $ICU4C_SRC_DIR/source/data/unidata 3803 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3804 cd ../../test/testdata 3805 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3806 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3807 3808 * run & fix ICU4J tests 3809 3810 ---------------------------------------------------------------------------- *** 3811 3812 Unicode 9.0 update for ICU 58 3813 3814 * Command-line environment setup 3815 3816 ICU_ROOT=~/svn.icu/trunk 3817 ICU_SRC_DIR=$ICU_ROOT/src 3818 ICUDT=icudt58b 3819 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3820 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3821 UNIDATA=$ICU_SRC_DIR/source/data/unidata 3822 3823 http://www.unicode.org/review/pri323/ -- beta review 3824 http://www.unicode.org/reports/uax-proposed-updates.html 3825 http://www.unicode.org/versions/beta-9.0.0.html 3826 http://www.unicode.org/versions/Unicode9.0.0/ 3827 http://www.unicode.org/reports/tr44/tr44-17.html 3828 3829 *** ICU Trac 3830 3831 - ticket:12526: integrate Unicode 9 3832 - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 3833 - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 3834 3835 *** CLDR Trac 3836 3837 - cldrbug 9414: UCA 9 3838 - ^/branches/markus/uni90 at r11518 from trunk at r11517 3839 3840 - cldrbug 8745: Unicode 9.0 script metadata 3841 3842 *** Unicode version numbers 3843 - makedata.mak 3844 - uchar.h 3845 - com.ibm.icu.util.VersionInfo 3846 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3847 3848 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3849 so that the makefiles see the new version number. 3850 3851 *** data files & enums & parser code 3852 3853 * file preparation 3854 3855 - download UCD & IDNA files 3856 - make sure that the Unicode data folder passed into preparseucd.py 3857 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3858 - only for manual diffs: remove version suffixes from the file names 3859 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3860 (see https://sites.google.com/site/unicodetools/inputdata) 3861 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3862 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3863 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3864 3865 - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 3866 and copy to $UNIDATA 3867 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 3868 3869 * preparseucd.py changes 3870 - remove or add new Unicode scripts from/to the 3871 only-in-ISO-15924 list according to the error messages: 3872 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 3873 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 3874 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 3875 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 3876 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3877 and in com.ibm.icu.dev.test.lang.TestUScript.java 3878 - DerivedNumericValues.txt new numeric values 3879 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 3880 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 3881 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 3882 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 3883 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 3884 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 3885 uchar.c, UCharacterProperty.java 3886 to support a new series of values 3887 - adjust preparseucd.py for Tangut algorithmic names 3888 in ppucd.txt: 3889 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 3890 -> 3891 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 3892 - avoid block-compressing most String/Miscellaneous property values, 3893 triggered by genprops not coping with a multi-code point Case_Folding on 3894 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 3895 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 3896 3897 * PropertyAliases.txt changes 3898 - 1 new property PCM=Prepended_Concatenation_Mark 3899 Ignore: Only useful for layout engines. 3900 Ok to list in ppucd.txt. 3901 3902 * PropertyValueAliases.txt new property values 3903 blk; Adlam ; Adlam 3904 blk; Bhaiksuki ; Bhaiksuki 3905 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 3906 blk; Glagolitic_Sup ; Glagolitic_Supplement 3907 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 3908 blk; Marchen ; Marchen 3909 blk; Mongolian_Sup ; Mongolian_Supplement 3910 blk; Newa ; Newa 3911 blk; Osage ; Osage 3912 blk; Tangut ; Tangut 3913 blk; Tangut_Components ; Tangut_Components 3914 -> add to uchar.h 3915 use long property names for enum constants 3916 -> add to UCharacter.UnicodeBlock IDs 3917 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3918 replace public static final int \1_ID = \2; \3 3919 -> add to UCharacter.UnicodeBlock objects 3920 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3921 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3922 3923 GCB; EB ; E_Base 3924 GCB; EBG ; E_Base_GAZ 3925 GCB; EM ; E_Modifier 3926 GCB; GAZ ; Glue_After_Zwj 3927 GCB; ZWJ ; ZWJ 3928 -> uchar.h & UCharacter.GraphemeClusterBreak 3929 3930 jg ; African_Feh ; African_Feh 3931 jg ; African_Noon ; African_Noon 3932 jg ; African_Qaf ; African_Qaf 3933 -> uchar.h & UCharacter.JoiningGroup 3934 3935 lb ; EB ; E_Base 3936 lb ; EM ; E_Modifier 3937 lb ; ZWJ ; ZWJ 3938 -> uchar.h & UCharacter.LineBreak 3939 3940 sc ; Adlm ; Adlam 3941 sc ; Bhks ; Bhaiksuki 3942 sc ; Marc ; Marchen 3943 sc ; Newa ; Newa 3944 sc ; Osge ; Osage 3945 sc ; Tang ; Tangut 3946 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 3947 3948 WB ; EB ; E_Base 3949 WB ; EBG ; E_Base_GAZ 3950 WB ; EM ; E_Modifier 3951 WB ; GAZ ; Glue_After_Zwj 3952 WB ; ZWJ ; ZWJ 3953 -> uchar.h & UCharacter.WordBreak 3954 3955 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3956 (not strictly necessary for NOT_ENCODED scripts) 3957 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3958 3959 * generate normalization data files 3960 cd $ICU_ROOT/dbg 3961 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3962 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3963 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3964 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3965 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3966 3967 * build ICU (make install) 3968 so that the tools build can pick up the new definitions from the installed header files. 3969 3970 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 3971 3972 * build Unicode tools using CMake+make 3973 3974 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3975 3976 # Location (--prefix) of where ICU was installed. 3977 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 3978 # Location of the ICU source tree. 3979 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 3980 3981 ~/svn.icutools/trunk/dbg/unicode/c$ 3982 cmake ../../../src/unicode/c 3983 make 3984 3985 * generate core properties data files 3986 ~/svn.icutools/trunk/dbg/unicode/c$ 3987 genprops/genprops $ICU_SRC_DIR 3988 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 3989 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 3990 - rebuild ICU (make install) & tools 3991 3992 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3993 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3994 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3995 - Unicode 6.0..9.0: U+2260, U+226E, U+226F 3996 - nothing new in 9.0, no test file to update 3997 3998 * run & fix ICU4C tests 3999 - Andy handles RBBI & spoof check test failures 4000 4001 * collation: CLDR collation root, UCA DUCET 4002 4003 - UCA DUCET goes into Mark's Unicode tools, see 4004 https://sites.google.com/site/unicodetools/home#TOC-UCA 4005 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 4006 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 4007 4008 - cd (CLDR UCA branch)/common/uca/ 4009 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4010 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 4011 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4012 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 4013 (note removing the underscore before "Rules") 4014 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4015 - restore TODO diffs in UCARules.txt 4016 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4017 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 4018 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4019 from the CLDR root files (..._CLDR_..._SHORT.txt) 4020 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 4021 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 4022 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 4023 - if CLDR common/uca/unihan-index.txt changes, then update 4024 CLDR common/collation/root.xml <collation type="private-unihan"> 4025 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 4026 4027 - run genuca, see command line above; 4028 deal with 4029 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 4030 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 4031 (add the character to genuca.cpp sampleCharsToScripts[]) 4032 + look up the USCRIPT_ code for the new sample characters 4033 (should be obvious from the comment in the error output) 4034 + *add* mappings to sampleCharsToScripts[], do not replace them 4035 (in case the script sample characters flip-flop) 4036 + insert new scripts in DUCET script order, see the top_byte table 4037 at the beginning of FractionalUCA.txt 4038 - rebuild ICU4C 4039 4040 * Unihan collators 4041 - run Unicode Tools 4042 org.unicode.draft.GenerateUnihanCollators 4043 with VM arguments 4044 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 4045 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 4046 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 4047 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 4048 -DUVERSION=9.0.0 4049 -ea 4050 - run Unicode Tools 4051 org.unicode.draft.GenerateUnihanCollatorFiles 4052 with the same arguments 4053 - check CLDR diffs 4054 cd ~/svn.cldr/trunk 4055 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 4056 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 4057 - copy to CLDR 4058 cd ~/svn.cldr/trunk 4059 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 4060 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 4061 - commit to CLDR 4062 - generate ICU zh collation data: run CLDR 4063 org.unicode.cldr.icu.NewLdml2IcuConverter 4064 with program arguments 4065 -t collation 4066 -s /home/mscherer/svn.cldr/trunk/common/collation 4067 -m /home/mscherer/svn.cldr/trunk/common/supplemental 4068 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 4069 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 4070 zh 4071 and VM arguments 4072 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 4073 - rebuild ICU4C 4074 4075 * run & fix ICU4C tests, now with new CLDR collation root data 4076 - run all tests with the collation test data *_SHORT.txt or the full files 4077 (the full ones have comments, useful for debugging) 4078 - note on intltest: if collate/UCAConformanceTest fails, then 4079 utility/MultithreadTest/TestCollators will fail as well; 4080 fix the conformance test before looking into the multi-thread test 4081 4082 * update Java data files 4083 - refresh just the UCD/UCA-related/derived files, just to be safe 4084 - see (ICU4C)/source/data/icu4j-readme.txt 4085 - mkdir /tmp/icu4j 4086 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4087 output: 4088 ... 4089 Unicode .icu files built to ./out/build/icudt58l 4090 echo timestamp > uni-core-data 4091 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 4092 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 4093 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 4094 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 4095 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 4096 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 4097 mkdir -p /tmp/icu4j/main/shared/data 4098 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4099 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 4100 mkdir -p /tmp/icu4j/main/shared/data 4101 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4102 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 4103 - copy the big-endian Unicode data files to another location, 4104 separate from the other data files, 4105 and then refresh ICU4J 4106 cd ~/svn.icu/trunk/dbg/data/out/icu4j 4107 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4108 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4109 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4110 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4111 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 4112 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4113 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4114 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4115 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4116 4117 * When refreshing all of ICU4J data from ICU4C 4118 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4119 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4120 or 4121 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4122 4123 * update CollationFCD.java 4124 + copy & paste the initializers of lcccIndex[] etc. from 4125 ICU4C/source/i18n/collationfcd.cpp to 4126 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 4127 4128 * refresh Java test .txt files 4129 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4130 cd $ICU_SRC_DIR/source/data/unidata 4131 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4132 cd ../../test/testdata 4133 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4134 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4135 4136 * run & fix ICU4J tests 4137 4138 *** LayoutEngine script information 4139 4140 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4141 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4142 in the working directory. 4143 4144 (It also generates ScriptRunData.cpp, which is no longer needed.) 4145 4146 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 4147 (a plain text file) 4148 which maps ICU versions to the numbers of script/language constants 4149 that were added then. 4150 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 4151 4152 The generated files have a current copyright date and "@deprecated" statement. 4153 4154 * Review changes, fix Java tool if necessary, and copy to ICU4C 4155 cd ~/svn.icu4j/trunk/src 4156 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4157 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 4158 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 4159 4160 *** API additions 4161 - send notice to icu-design about new born-@stable API (enum constants etc.) 4162 4163 *** merge the Unicode update branches back onto the trunk 4164 - do not merge the icudata.jar and testdata.jar, 4165 instead rebuild them from merged & tested ICU4C 4166 - make sure that changes to Unicode tools & ICU tools are checked in 4167 http://www.unicode.org/utility/trac/log/trunk/unicodetools 4168 http://bugs.icu-project.org/trac/log/tools/trunk 4169 4170 ---------------------------------------------------------------------------- *** 4171 4172 New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 4173 4174 Adding 4175 - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 4176 - new combination/alias codes: Hanb, Jamo 4177 - used in CLDR 29 and in spoof checker 4178 - new Z* code: Zsye 4179 4180 Add new codes to uscript.h & UScript.java, see Unicode update logs. 4181 -> com.ibm.icu.lang.UScript 4182 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4183 replace public static final int \1 = \2; \3 4184 4185 Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 4186 add new script codes. 4187 "Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 4188 4189 Note: If we have to run preparseucd.py again before the Unicode 9 update, 4190 then we need to manually keep/restore the new script codes. 4191 4192 ICU_ROOT=~/svn.icu/trunk 4193 ICU_SRC_DIR=$ICU_ROOT/src 4194 ICUDT=icudt57b 4195 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 4196 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 4197 UNIDATA=$ICU_SRC_DIR/source/data/unidata 4198 4199 Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 4200 see https://unicode-org.atlassian.net/browse/ICU-12141 4201 4202 make install, then icutools cmake & make, then 4203 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 4204 4205 Generate Java data as usual, only update pnames.icu & uprops.icu. 4206 4207 *** LayoutEngine script information 4208 4209 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4210 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4211 in the working directory. 4212 4213 (It also generates ScriptRunData.cpp, which is no longer needed.) 4214 4215 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 4216 (a plain text file) 4217 which maps ICU versions to the numbers of script/language constants 4218 that were added then. 4219 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 4220 4221 The generated files have a current copyright date and "@deprecated" statement. 4222 4223 * Review changes, fix Java tool if necessary, and copy to ICU4C 4224 cd ~/svn.icu4j/trunk/src 4225 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4226 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 4227 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 4228 4229 ---------------------------------------------------------------------------- *** 4230 4231 Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 4232 4233 Edit preparseucd.py to add & parse new properties. 4234 They share the UCD property namespace but are not listed in PropertyAliases.txt. 4235 4236 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 4237 Initial data from emoji/2.0/ 4238 4239 ICU_ROOT=~/svn.icu/trunk 4240 ICU_SRC_DIR=$ICU_ROOT/src 4241 ICUDT=icudt56b 4242 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 4243 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 4244 UNIDATA=$ICU_SRC_DIR/source/data/unidata 4245 4246 Add binary-property constants to uchar.h enum UProperty & UProperty.java. 4247 4248 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 4249 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 4250 4251 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 4252 4253 make install, then icutools cmake & make, then 4254 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 4255 4256 Generate Java data as usual, only update pnames.icu & uprops.icu. 4257 4258 ---------------------------------------------------------------------------- *** 4259 4260 Unicode 8.0 update for ICU 56 4261 4262 * Command-line environment setup 4263 4264 ICU_ROOT=~/svn.icu/trunk 4265 ICU_SRC_DIR=$ICU_ROOT/src 4266 ICUDT=icudt56b 4267 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 4268 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 4269 UNIDATA=$ICU_SRC_DIR/source/data/unidata 4270 4271 http://www.unicode.org/review/pri297/ -- beta review 4272 http://www.unicode.org/reports/uax-proposed-updates.html 4273 http://unicode.org/versions/beta-8.0.0.html 4274 http://www.unicode.org/versions/Unicode8.0.0/ 4275 http://www.unicode.org/reports/tr44/tr44-15.html 4276 4277 *** ICU Trac 4278 4279 - ticket:11574: Unicode 8 4280 - C++ branches/markus/uni80 at r37351 from trunk at r37343 4281 - Java branches/markus/uni80 at r37352 from trunk at r37338 4282 4283 *** CLDR Trac 4284 4285 - cldrbug 8311: UCA 8 4286 - branches/markus/uni80 at r11518 from trunk at r11517 4287 4288 - cldrbug 8109: Unicode 8.0 script metadata 4289 - cldrbug 8418: Updated segmentation for Unicode 8.0 4290 4291 *** Unicode version numbers 4292 - makedata.mak 4293 - uchar.h 4294 - com.ibm.icu.util.VersionInfo 4295 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4296 4297 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 4298 so that the makefiles see the new version number. 4299 4300 *** data files & enums & parser code 4301 4302 * file preparation 4303 4304 - download UCD & IDNA files 4305 - make sure that the Unicode data folder passed into preparseucd.py 4306 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4307 - only for manual diffs: remove version suffixes from the file names 4308 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 4309 (see https://sites.google.com/site/unicodetools/inputdata) 4310 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 4311 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 4312 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4313 4314 - also: from http://unicode.org/Public/security/8.0.0/ download new 4315 confusables.txt & confusablesWholeScript.txt 4316 and copy to $UNIDATA 4317 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 4318 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 4319 4320 * initial preparseucd.py changes 4321 - remove new Unicode scripts from the 4322 only-in-ISO-15924 list according to the error message: 4323 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 4324 from _scripts_only_in_iso15924 4325 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4326 and in com.ibm.icu.dev.test.lang.TestUScript.java 4327 - property and file name change: 4328 IndicMatraCategory -> IndicPositionalCategory 4329 - UnicodeData.txt unusual numeric values (improper fractions) 4330 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 4331 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 4332 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 4333 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 4334 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 4335 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 4336 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 4337 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 4338 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 4339 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 4340 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 4341 which are listed in DerivedNumericValues.txt; 4342 keeps storage in data file simple 4343 4344 * PropertyValueAliases.txt changes 4345 - 10 new Block (blk) values: 4346 blk; Ahom ; Ahom 4347 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 4348 blk; Cherokee_Sup ; Cherokee_Supplement 4349 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 4350 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 4351 blk; Hatran ; Hatran 4352 blk; Multani ; Multani 4353 blk; Old_Hungarian ; Old_Hungarian 4354 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 4355 blk; Sutton_SignWriting ; Sutton_SignWriting 4356 -> add to uchar.h 4357 use long property names for enum constants 4358 -> add to UCharacter.UnicodeBlock IDs 4359 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4360 replace public static final int \1_ID = \2; \3 4361 -> add to UCharacter.UnicodeBlock objects 4362 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4363 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4364 - 6 new Script (sc) values: 4365 sc ; Ahom ; Ahom 4366 sc ; Hatr ; Hatran 4367 sc ; Hluw ; Anatolian_Hieroglyphs 4368 sc ; Hung ; Old_Hungarian 4369 sc ; Mult ; Multani 4370 sc ; Sgnw ; SignWriting 4371 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 4372 4373 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4374 (not strictly necessary for NOT_ENCODED scripts) 4375 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 4376 4377 * generate normalization data files 4378 cd $ICU_ROOT/dbg 4379 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 4380 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4381 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4382 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4383 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4384 4385 * build ICU (make install) 4386 so that the tools build can pick up the new definitions from the installed header files. 4387 4388 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4389 4390 * build Unicode tools using CMake+make 4391 4392 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4393 4394 # Location (--prefix) of where ICU was installed. 4395 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 4396 # Location of the ICU source tree. 4397 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 4398 4399 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4400 ~/svn.icutools/trunk/dbg/unicode/c$ make 4401 4402 * generate core properties data files 4403 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 4404 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 4405 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 4406 - rebuild ICU (make install) & tools 4407 - run genuca again (see step above) so that it picks up the new nfc.nrm 4408 - rebuild ICU (make install) & tools 4409 4410 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4411 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4412 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4413 - Unicode 6.0..8.0: U+2260, U+226E, U+226F 4414 - nothing new in 8.0, no test file to update 4415 4416 * run & fix ICU4C tests 4417 - bad Cherokee case folding due to difference in fallbacks: 4418 UCD case folding falls back to no mapping, 4419 ICU runtime case folding falls back to lowercasing; 4420 fixed casepropsbuilder.cpp to generate scf mappings to self 4421 when there is an slc mapping but no scf 4422 - Andy handles RBBI & spoof check test failures 4423 4424 * collation: CLDR collation root, UCA DUCET 4425 4426 - UCA DUCET goes into Mark's Unicode tools, see 4427 https://sites.google.com/site/unicodetools/home#TOC-UCA 4428 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 4429 - cd (CLDR UCA branch)/common/uca/ 4430 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4431 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 4432 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4433 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 4434 (note removing the underscore before "Rules") 4435 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4436 - restore TODO diffs in UCARules.txt 4437 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4438 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 4439 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4440 from the CLDR root files (..._CLDR_..._SHORT.txt) 4441 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 4442 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 4443 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 4444 - if CLDR common/uca/unihan-index.txt changes, then update 4445 CLDR common/collation/root.xml <collation type="private-unihan"> 4446 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 4447 - run genuca, see command line above; 4448 deal with 4449 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 4450 (add the character to genuca.cpp sampleCharsToScripts[]) 4451 + look up the script for the new sample characters 4452 (e.g., in FractionalUCA.txt) 4453 + *add* mappings to sampleCharsToScripts[], do not replace them 4454 (in case the script sample characters flip-flop) 4455 + insert new scripts in DUCET script order, see the top_byte table 4456 at the beginning of FractionalUCA.txt 4457 - rebuild ICU4C 4458 4459 * run & fix ICU4C tests, now with new CLDR collation root data 4460 - run all tests with the collation test data *_SHORT.txt or the full files 4461 (the full ones have comments, useful for debugging) 4462 - note on intltest: if collate/UCAConformanceTest fails, then 4463 utility/MultithreadTest/TestCollators will fail as well; 4464 fix the conformance test before looking into the multi-thread test 4465 - fixed bug in CollationWeights::getWeightRanges() 4466 exposed by new data and CollationTest::TestRootElements 4467 4468 * update Java data files 4469 - refresh just the UCD/UCA-related/derived files, just to be safe 4470 - see (ICU4C)/source/data/icu4j-readme.txt 4471 - mkdir /tmp/icu4j 4472 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4473 output: 4474 ... 4475 Unicode .icu files built to ./out/build/icudt56l 4476 echo timestamp > uni-core-data 4477 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 4478 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 4479 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 4480 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 4481 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 4482 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 4483 mkdir -p /tmp/icu4j/main/shared/data 4484 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4485 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 4486 mkdir -p /tmp/icu4j/main/shared/data 4487 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4488 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 4489 - copy the big-endian Unicode data files to another location, 4490 separate from the other data files, 4491 and then refresh ICU4J 4492 cd ~/svn.icu/trunk/dbg/data/out/icu4j 4493 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4494 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4495 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4496 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4497 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 4498 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4499 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4500 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4501 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4502 4503 * When refreshing all of ICU4J data from ICU4C 4504 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4505 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4506 or 4507 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4508 4509 * update CollationFCD.java 4510 + copy & paste the initializers of lcccIndex[] etc. from 4511 ICU4C/source/i18n/collationfcd.cpp to 4512 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 4513 4514 * refresh Java test .txt files 4515 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4516 cd $ICU_SRC_DIR/source/data/unidata 4517 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4518 cd ../../test/testdata 4519 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4520 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4521 4522 * run & fix ICU4J tests 4523 4524 *** LayoutEngine script information 4525 4526 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 4527 because the layout engine was deprecated in ICU 54. 4528 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 4529 to write lines that we used to add manually. 4530 4531 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4532 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4533 in the working directory. 4534 4535 (It also generates ScriptRunData.cpp, which is no longer needed.) 4536 4537 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 4538 (a plain text file) 4539 which maps ICU versions to the numbers of script/language constants 4540 that were added then. 4541 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 4542 4543 The generated files have a current copyright date and "@deprecated" statement. 4544 4545 * Review changes, fix Java tool if necessary, and copy to ICU4C 4546 cd ~/svn.icu4j/trunk/src 4547 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4548 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 4549 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 4550 4551 *** API additions 4552 - send notice to icu-design about new born-@stable API (enum constants etc.) 4553 4554 *** merge the Unicode update branches back onto the trunk 4555 - do not merge the icudata.jar and testdata.jar, 4556 instead rebuild them from merged & tested ICU4C 4557 - make sure that changes to Unicode tools & ICU tools are checked in 4558 http://www.unicode.org/utility/trac/log/trunk/unicodetools 4559 http://bugs.icu-project.org/trac/log/tools/trunk 4560 4561 ---------------------------------------------------------------------------- *** 4562 4563 Unicode 7.0 update for ICU 54 4564 4565 http://www.unicode.org/review/pri271/ -- beta review 4566 http://www.unicode.org/reports/uax-proposed-updates.html 4567 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 4568 http://www.unicode.org/reports/tr44/tr44-13.html 4569 4570 *** ICU Trac 4571 4572 - ticket 10821: Unicode 7.0, UCA 7.0 4573 - C++ branches/markus/uni70 at r35584 from trunk at r35580 4574 - Java branches/markus/uni70 at r35587 from trunk at r35545 4575 4576 *** CLDR Trac 4577 4578 - ticket 7195: UCA 7.0 CLDR root collation 4579 - branches/markus/uni70 at r10062 from trunk at r10061 4580 4581 - ticket 6762: script metadata for Unicode 7.0 new scripts 4582 4583 *** Unicode version numbers 4584 - makedata.mak 4585 - uchar.h 4586 - com.ibm.icu.util.VersionInfo 4587 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4588 4589 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 4590 so that the makefiles see the new version number. 4591 4592 *** data files & enums & parser code 4593 4594 * file preparation 4595 4596 - download UCD & IDNA files 4597 - make sure that the Unicode data folder passed into preparseucd.py 4598 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4599 - only for manual diffs: remove version suffixes from the file names 4600 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 4601 (see https://sites.google.com/site/unicodetools/inputdata) 4602 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 4603 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 4604 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4605 - Restore TODO diffs in source/data/unidata/UCARules.txt 4606 cd $ICU_SRC_DIR 4607 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 4608 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 4609 4610 - also: from http://unicode.org/Public/security/7.0.0/ download new 4611 confusables.txt & confusablesWholeScript.txt 4612 and copy to $ICU_ROOT/src/source/data/unidata/ 4613 4614 * initial preparseucd.py changes 4615 - remove new Unicode scripts from the 4616 only-in-ISO-15924 list according to the error message: 4617 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 4618 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 4619 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 4620 from _scripts_only_in_iso15924 4621 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4622 and in com.ibm.icu.dev.test.lang.TestUScript.java 4623 - NamesList.txt now has a heading with a non-ASCII character 4624 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 4625 + escape non-ASCII characters in heading comments 4626 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 4627 + get the copyright from the first file whose copyright line contains the current year 4628 4629 * PropertyValueAliases.txt changes 4630 - 32 new Block (blk) values: 4631 blk; Bassa_Vah ; Bassa_Vah 4632 blk; Caucasian_Albanian ; Caucasian_Albanian 4633 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 4634 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 4635 blk; Duployan ; Duployan 4636 blk; Elbasan ; Elbasan 4637 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 4638 blk; Grantha ; Grantha 4639 blk; Khojki ; Khojki 4640 blk; Khudawadi ; Khudawadi 4641 blk; Latin_Ext_E ; Latin_Extended_E 4642 blk; Linear_A ; Linear_A 4643 blk; Mahajani ; Mahajani 4644 blk; Manichaean ; Manichaean 4645 blk; Mende_Kikakui ; Mende_Kikakui 4646 blk; Modi ; Modi 4647 blk; Mro ; Mro 4648 blk; Myanmar_Ext_B ; Myanmar_Extended_B 4649 blk; Nabataean ; Nabataean 4650 blk; Old_North_Arabian ; Old_North_Arabian 4651 blk; Old_Permic ; Old_Permic 4652 blk; Ornamental_Dingbats ; Ornamental_Dingbats 4653 blk; Pahawh_Hmong ; Pahawh_Hmong 4654 blk; Palmyrene ; Palmyrene 4655 blk; Pau_Cin_Hau ; Pau_Cin_Hau 4656 blk; Psalter_Pahlavi ; Psalter_Pahlavi 4657 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 4658 blk; Siddham ; Siddham 4659 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 4660 blk; Sup_Arrows_C ; Supplemental_Arrows_C 4661 blk; Tirhuta ; Tirhuta 4662 blk; Warang_Citi ; Warang_Citi 4663 -> add to uchar.h 4664 use long property names for enum constants 4665 -> add to UCharacter.UnicodeBlock IDs 4666 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4667 replace public static final int \1_ID = \2; \3 4668 -> add to UCharacter.UnicodeBlock objects 4669 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4670 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4671 - 28 new Joining_Group (jg) values: 4672 jg ; Manichaean_Aleph ; Manichaean_Aleph 4673 jg ; Manichaean_Ayin ; Manichaean_Ayin 4674 jg ; Manichaean_Beth ; Manichaean_Beth 4675 jg ; Manichaean_Daleth ; Manichaean_Daleth 4676 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 4677 jg ; Manichaean_Five ; Manichaean_Five 4678 jg ; Manichaean_Gimel ; Manichaean_Gimel 4679 jg ; Manichaean_Heth ; Manichaean_Heth 4680 jg ; Manichaean_Hundred ; Manichaean_Hundred 4681 jg ; Manichaean_Kaph ; Manichaean_Kaph 4682 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 4683 jg ; Manichaean_Mem ; Manichaean_Mem 4684 jg ; Manichaean_Nun ; Manichaean_Nun 4685 jg ; Manichaean_One ; Manichaean_One 4686 jg ; Manichaean_Pe ; Manichaean_Pe 4687 jg ; Manichaean_Qoph ; Manichaean_Qoph 4688 jg ; Manichaean_Resh ; Manichaean_Resh 4689 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 4690 jg ; Manichaean_Samekh ; Manichaean_Samekh 4691 jg ; Manichaean_Taw ; Manichaean_Taw 4692 jg ; Manichaean_Ten ; Manichaean_Ten 4693 jg ; Manichaean_Teth ; Manichaean_Teth 4694 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 4695 jg ; Manichaean_Twenty ; Manichaean_Twenty 4696 jg ; Manichaean_Waw ; Manichaean_Waw 4697 jg ; Manichaean_Yodh ; Manichaean_Yodh 4698 jg ; Manichaean_Zayin ; Manichaean_Zayin 4699 jg ; Straight_Waw ; Straight_Waw 4700 -> uchar.h & UCharacter.JoiningGroup 4701 - 23 new Script (sc) values: 4702 sc ; Aghb ; Caucasian_Albanian 4703 sc ; Bass ; Bassa_Vah 4704 sc ; Dupl ; Duployan 4705 sc ; Elba ; Elbasan 4706 sc ; Gran ; Grantha 4707 sc ; Hmng ; Pahawh_Hmong 4708 sc ; Khoj ; Khojki 4709 sc ; Lina ; Linear_A 4710 sc ; Mahj ; Mahajani 4711 sc ; Mani ; Manichaean 4712 sc ; Mend ; Mende_Kikakui 4713 sc ; Modi ; Modi 4714 sc ; Mroo ; Mro 4715 sc ; Narb ; Old_North_Arabian 4716 sc ; Nbat ; Nabataean 4717 sc ; Palm ; Palmyrene 4718 sc ; Pauc ; Pau_Cin_Hau 4719 sc ; Perm ; Old_Permic 4720 sc ; Phlp ; Psalter_Pahlavi 4721 sc ; Sidd ; Siddham 4722 sc ; Sind ; Khudawadi 4723 sc ; Tirh ; Tirhuta 4724 sc ; Wara ; Warang_Citi 4725 -> uscript.h (many were added before) 4726 comment "Mende Kikakui" for USCRIPT_MENDE 4727 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 4728 -> com.ibm.icu.lang.UScript 4729 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4730 replace public static final int \1 = \2; \3 4731 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4732 (added 2012-11-01) 4733 Ahom 338 Ahom 4734 Hatr 127 Hatran 4735 Mult 323 Multani 4736 (added 2013-10-12) 4737 Modi 324 Modi 4738 Pauc 263 Pau Cin Hau 4739 Sidd 302 Siddham 4740 -> uscript.h (some overlap with additions from Unicode) 4741 -> com.ibm.icu.lang.UScript 4742 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4743 replace public static final int \1 = \2; \3 4744 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 4745 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4746 and in com.ibm.icu.dev.test.lang.TestUScript.java 4747 4748 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4749 (not strictly necessary for NOT_ENCODED scripts) 4750 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 4751 4752 * generate normalization data files 4753 - cd $ICU_ROOT/dbg 4754 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 4755 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 4756 - UNIDATA=$ICU_SRC_DIR/source/data/unidata 4757 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 4758 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4759 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4760 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4761 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4762 4763 * build ICU (make install) 4764 so that the tools build can pick up the new definitions from the installed header files. 4765 4766 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4767 4768 * build Unicode tools using CMake+make 4769 4770 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4771 4772 # Location (--prefix) of where ICU was installed. 4773 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 4774 # Location of the ICU source tree. 4775 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 4776 4777 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4778 ~/svn.icutools/trunk/dbg/unicode/c$ make 4779 4780 * genprops work 4781 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean 4782 + add second array of Joining_Group values for at most 10800..10FFF 4783 icutools: unicode/c/genprops/bidipropsbuilder.cpp 4784 icu: source/common/ubidi_props.h/.c/_data.h 4785 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 4786 4787 * generate core properties data files 4788 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 4789 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 4790 - rebuild ICU (make install) & tools 4791 - run genuca again (see step above) so that it picks up the new nfc.nrm 4792 - rebuild ICU (make install) & tools 4793 4794 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4795 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4796 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4797 - Unicode 6.0..7.0: U+2260, U+226E, U+226F 4798 - nothing new in 7.0, no test file to update 4799 4800 * run & fix ICU4C tests 4801 4802 * update Java data files 4803 - refresh just the UCD-related files, just to be safe 4804 - see (ICU4C)/source/data/icu4j-readme.txt 4805 - mkdir /tmp/icu4j 4806 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4807 output: 4808 ... 4809 Unicode .icu files built to ./out/build/icudt53l 4810 echo timestamp > uni-core-data 4811 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 4812 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 4813 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4814 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 4815 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 4816 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 4817 mkdir -p /tmp/icu4j/main/shared/data 4818 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4819 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 4820 mkdir -p /tmp/icu4j/main/shared/data 4821 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4822 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 4823 - copy the big-endian Unicode data files to another location, 4824 separate from the other data files 4825 ICUDT=icudt54b 4826 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4827 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4828 cd ~/svn.icu/uni70/dbg/data/out/icu4j 4829 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4830 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4831 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 4832 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4833 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4834 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4835 - refresh ICU4J 4836 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4837 4838 * update CollationFCD.java 4839 + copy & paste the initializers of lcccIndex[] etc. from 4840 ICU4C/source/i18n/collationfcd.cpp to 4841 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 4842 4843 * refresh Java test .txt files 4844 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4845 cd $ICU_SRC_DIR/source/data/unidata 4846 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4847 cd ../../test/testdata 4848 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4849 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4850 4851 * UCA 4852 4853 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 4854 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 4855 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 4856 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 4857 - output files are in ~/svn.unitools/Generated/uca/7.0.0/ 4858 - review data; compare files, use blankweights.sed or similar 4859 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 4860 - cd ~/svn.unitools/Generated/uca/7.0.0/ 4861 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4862 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 4863 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4864 (note removing the underscore before "Rules") 4865 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4866 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 4867 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4868 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4869 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 4870 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 4871 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 4872 - run genuca, see command line above 4873 - rebuild ICU4C 4874 - refresh ICU4J collation data: 4875 (subset of instructions above for properties data refresh, except copies all coll/*) 4876 ICUDT=icudt54b 4877 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4878 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4879 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4880 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4881 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4882 - note on intltest: if collate/UCAConformanceTest fails, then 4883 utility/MultithreadTest/TestCollators will fail as well; 4884 fix the conformance test before looking into the multi-thread test 4885 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 4886 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 4887 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 4888 4889 * When refreshing all of ICU4J data from ICU4C 4890 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4891 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4892 or 4893 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4894 4895 * run & fix ICU4J tests 4896 4897 *** LayoutEngine script information 4898 4899 (For details see the Unicode 5.2 change log below.) 4900 4901 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4902 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4903 in the working directory. 4904 (It also generates ScriptRunData.cpp, which is no longer needed.) 4905 4906 The generated files have a current copyright date and "@stable" statement. 4907 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 4908 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 4909 which may not contain dots any more. 4910 4911 - diff current <icu>/source/layout files vs. generated ones 4912 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4913 review and manually merge desired changes; 4914 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 4915 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4916 - if you just copy the above files, then 4917 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 4918 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4919 4920 *** API additions 4921 - send notice to icu-design about new born-@stable API (enum constants etc.) 4922 4923 *** merge the Unicode update branches back onto the trunk 4924 - do not merge the icudata.jar and testdata.jar, 4925 instead rebuild them from merged & tested ICU4C 4926 4927 ---------------------------------------------------------------------------- *** 4928 4929 Unicode 6.3 update 4930 4931 http://www.unicode.org/review/pri249/ -- beta review 4932 http://www.unicode.org/reports/uax-proposed-updates.html 4933 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 4934 http://www.unicode.org/reports/tr44/tr44-11.html 4935 4936 *** ICU Trac 4937 4938 - ticket 10128: update ICU to Unicode 6.3 beta 4939 - ticket 10168: update ICU to Unicode 6.3 final 4940 - C++ branches/markus/uni63 at r33552 from trunk at r33551 4941 - Java branches/markus/uni63 at r33550 from trunk at r33553 4942 4943 - ticket 10142: implement Unicode 6.3 bidi algorithm additions 4944 4945 *** Unicode version numbers 4946 - makedata.mak 4947 - uchar.h 4948 (configure.in & configure: have been modified to extract the version from uchar.h) 4949 - com.ibm.icu.util.VersionInfo 4950 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4951 4952 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 4953 so that the makefiles see the new version number. 4954 4955 *** data files & enums & parser code 4956 4957 * file preparation 4958 4959 - download UCD, UCA & IDNA files 4960 - make sure that the Unicode data folder passed into preparseucd.py 4961 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4962 - modify preparseucd.py: 4963 parse new file BidiBrackets.txt 4964 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 4965 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 4966 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4967 - Check test file diffs for previously commented-out, known-failing data lines; 4968 probably need to keep those commented out. 4969 4970 * PropertyAliases.txt changes 4971 - 1 new Enumerated Property 4972 bpt ; Bidi_Paired_Bracket_Type 4973 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 4974 -> ubidi_props.h & .c & UBiDiProps.java 4975 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 4976 -> uprops.cpp 4977 -> change ubidi.icu format version from 2.0 to 2.1 4978 - 1 new Miscellaneous Property 4979 bpb ; Bidi_Paired_Bracket 4980 -> uchar.h & UProperty.java 4981 -> ppucd.h & .cpp 4982 4983 * PropertyValueAliases.txt changes 4984 - 3 Bidi_Paired_Bracket_Type (bpt) values: 4985 bpt; c ; Close 4986 bpt; n ; None 4987 bpt; o ; Open 4988 -> uchar.h & UCharacter.BidiPairedBracketType 4989 -> ubidi_props.h & .c & UBiDiProps.java 4990 -> change ubidi.icu format version from 2.0 to 2.1 4991 - 4 new Bidi_Class (bc) values: 4992 bc ; FSI ; First_Strong_Isolate 4993 bc ; LRI ; Left_To_Right_Isolate 4994 bc ; RLI ; Right_To_Left_Isolate 4995 bc ; PDI ; Pop_Directional_Isolate 4996 -> uchar.h & UCharacterEnums.ECharacterDirection 4997 -> until the bidi code gets updated, 4998 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 4999 - 3 new Word_Break (WB) values: 5000 WB ; HL ; Hebrew_Letter 5001 WB ; SQ ; Single_Quote 5002 WB ; DQ ; Double_Quote 5003 -> uchar.h & UCharacter.WordBreak 5004 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 5005 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5006 (added 2012-10-16) 5007 Aghb 239 Caucasian Albanian 5008 Mahj 314 Mahajani 5009 -> uscript.h 5010 -> com.ibm.icu.lang.UScript 5011 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5012 replace public static final int \1 = \2;\3 5013 -> preparseucd.py _scripts_only_in_iso15924 5014 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5015 and in com.ibm.icu.dev.test.lang.TestUScript.java 5016 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 5017 (not strictly necessary for NOT_ENCODED scripts) 5018 5019 * generate normalization data files 5020 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 5021 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 5022 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 5023 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 5024 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 5025 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 5026 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 5027 5028 * build ICU (make install) 5029 so that the tools build can pick up the new definitions from the installed header files. 5030 5031 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 5032 5033 * build Unicode tools using CMake+make 5034 5035 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 5036 5037 # Location (--prefix) of where ICU was installed. 5038 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 5039 # Location of the ICU source tree. 5040 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 5041 5042 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 5043 ~/svn.icutools/trunk/dbg/unicode/c$ make 5044 5045 * generate core properties data files 5046 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 5047 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 5048 - rebuild ICU (make install) & tools 5049 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 5050 - rebuild ICU (make install) & tools 5051 5052 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5053 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5054 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5055 - Unicode 6.0..6.3: U+2260, U+226E, U+226F 5056 - nothing new in 6.3, no test file to update 5057 5058 * update Java data files 5059 - refresh just the UCD-related files, just to be safe 5060 - see (ICU4C)/source/data/icu4j-readme.txt 5061 - mkdir /tmp/icu4j 5062 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5063 output: 5064 ... 5065 Unicode .icu files built to ./out/build/icudt52l 5066 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 5067 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 5068 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5069 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 5070 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 5071 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 5072 mkdir -p /tmp/icu4j/main/shared/data 5073 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5074 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 5075 mkdir -p /tmp/icu4j/main/shared/data 5076 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 5077 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 5078 - copy the big-endian Unicode data files to another location, 5079 separate from the other data files 5080 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 5081 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 5082 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 5083 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 5084 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 5085 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 5086 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 5087 - refresh ICU4J 5088 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 5089 5090 * refresh Java test .txt files 5091 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5092 5093 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 5094 5095 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 5096 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 5097 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5098 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5099 (note removing the underscore before "Rules") 5100 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 5101 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5102 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 5103 - check test file diffs for previously commented-out, known-failing data lines; 5104 probably need to keep those commented out 5105 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 5106 - run genuca, see command line above 5107 - rebuild ICU4C 5108 - refresh ICU4J collation data: 5109 (subset of instructions above for properties data refresh, except copies all coll/*) 5110 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5111 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 5112 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 5113 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 5114 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 5115 - note on intltest: if collate/UCAConformanceTest fails, then 5116 utility/MultithreadTest/TestCollators will fail as well; 5117 fix the conformance test before looking into the multi-thread test 5118 5119 * test ICU, fix test code where necessary 5120 5121 * When refreshing all of ICU4J data from ICU4C 5122 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5123 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5124 or 5125 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5126 5127 *** LayoutEngine script information 5128 - skipped for Unicode 6.3: no new scripts 5129 5130 *** merge the Unicode update branches back onto the trunk 5131 - do not merge the icudata.jar and testdata.jar, 5132 instead rebuild them from merged & tested ICU4C 5133 5134 ---------------------------------------------------------------------------- *** 5135 5136 Unicode 6.2 update 5137 5138 http://www.unicode.org/review/pri230/ 5139 http://www.unicode.org/versions/beta-6.2.0.html 5140 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 5141 http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 5142 http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 5143 http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 5144 http://www.unicode.org/reports/tr46/tr46-8.html IDNA 5145 http://unicode.org/Public/idna/6.2.0/ 5146 5147 *** ICU Trac 5148 5149 - ticket 9515: Unicode 6.2: final ICU update 5150 5151 - ticket 9514: UCA 6.2: fix UCARules.txt 5152 5153 - ticket 9437: update ICU to Unicode 6.2 5154 - C++ branches/markus/uni62 at r32050 from trunk at r32041 5155 - Java branches/markus/uni62 at r32068 from trunk at r32066 5156 5157 *** Unicode version numbers 5158 - makedata.mak 5159 - uchar.h 5160 (configure.in & configure: have been modified to extract the version from uchar.h) 5161 - com.ibm.icu.util.VersionInfo 5162 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 5163 5164 *** data files & enums & parser code 5165 5166 * file preparation 5167 5168 - download UCD, UCA & IDNA files 5169 - make sure that the Unicode data folder passed into preparseucd.py 5170 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 5171 - modify preparseucd.py: NamesList.txt is now in UTF-8 5172 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 5173 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 5174 - Check test file diffs for previously commented-out, known-failing data lines; 5175 probably need to keep those commented out. 5176 5177 * PropertyValueAliases.txt changes 5178 - 1 new Line_Break (lb) value: 5179 lb ; RI ; Regional_Indicator 5180 -> uchar.h & UCharacter.LineBreak 5181 - 1 new Word_Break (WB) value: 5182 WB ; RI ; Regional_Indicator 5183 -> uchar.h & UCharacter.WordBreak 5184 - 1 new Grapheme_Cluster_Break (GCB) value: 5185 GCB; RI ; Regional_Indicator 5186 -> uchar.h & UCharacter.GraphemeClusterBreak 5187 5188 * 3 new numeric values 5189 The new value -1, which was really supposed to be NaN but that would have required 5190 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 5191 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 5192 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 5193 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 5194 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 5195 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 5196 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 5197 -> uprops.h, uchar.c & UCharacterProperty.java 5198 -> cucdtst.c & UCharacterTest.java 5199 5200 * generate normalization data files 5201 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 5202 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 5203 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 5204 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 5205 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 5206 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 5207 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 5208 5209 * build ICU (make install) 5210 so that the tools build can pick up the new definitions from the installed header files. 5211 * build Unicode tools using CMake+make 5212 5213 * generate core properties data files 5214 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 5215 - in initial bootstrapping, change the UCA version 5216 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 5217 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 5218 - rebuild ICU (make install) & tools 5219 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 5220 check if the UCA version in FractionalUCA.txt matches the new Unicode version 5221 (see step above) 5222 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 5223 - rebuild ICU (make install) & tools 5224 5225 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5226 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5227 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5228 - Unicode 6.0..6.2: U+2260, U+226E, U+226F 5229 - nothing new in 6.2, no test file to update 5230 5231 * update Java data files 5232 - refresh just the UCD-related files, just to be safe 5233 - see (ICU4C)/source/data/icu4j-readme.txt 5234 - mkdir /tmp/icu4j 5235 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5236 output: 5237 ... 5238 Unicode .icu files built to ./out/build/icudt50l 5239 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 5240 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 5241 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5242 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 5243 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 5244 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 5245 mkdir -p /tmp/icu4j/main/shared/data 5246 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5247 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 5248 mkdir -p /tmp/icu4j/main/shared/data 5249 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 5250 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 5251 - copy the big-endian Unicode data files to another location, 5252 separate from the other data files 5253 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 5254 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 5255 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 5256 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 5257 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 5258 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 5259 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 5260 - refresh ICU4J 5261 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 5262 5263 * refresh Java test .txt files 5264 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5265 5266 * UCA 5267 5268 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 5269 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 5270 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5271 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5272 (note removing the underscore before "Rules") 5273 - update (ICU4C)/source/test/testdata/CollationTest_*.txt 5274 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5275 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 5276 - check test file diffs for previously commented-out, known-failing data lines; 5277 probably need to keep those commented out 5278 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 5279 - run genuca, see command line above 5280 - rebuild ICU4C 5281 - refresh ICU4J collation data: 5282 (subset of instructions above for properties data refresh, except copies all coll/*) 5283 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5284 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 5285 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 5286 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 5287 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 5288 - note on intltest: if collate/UCAConformanceTest fails, then 5289 utility/MultithreadTest/TestCollators will fail as well; 5290 fix the conformance test before looking into the multi-thread test 5291 5292 * test ICU, fix test code where necessary 5293 5294 * When refreshing all of ICU4J data from ICU4C 5295 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5296 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5297 or 5298 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5299 5300 *** LayoutEngine script information 5301 - skipped for Unicode 6.2: no new scripts 5302 5303 *** merge the Unicode update branches back onto the trunk 5304 - do not merge the icudata.jar and testdata.jar, 5305 instead rebuild them from merged & tested ICU4C 5306 5307 ---------------------------------------------------------------------------- *** 5308 5309 Future Unicode update 5310 5311 Tools simplified since the Unicode 6.1 update. See 5312 - https://icu.unicode.org/design/props/ppucd 5313 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 5314 5315 * Unicode version numbers 5316 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 5317 5318 * file preparation 5319 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 5320 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 5321 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 5322 - Check test file diffs for previously commented-out, known-failing data lines; 5323 probably need to keep those commented out. 5324 5325 * PropertyValueAliases.txt changes 5326 - Script codes that are in ISO 15924 but not in Unicode are now listed in 5327 preparseucd.py, in the _scripts_only_in_iso15924 variable. 5328 If there are new ISO codes, then add them. 5329 If Unicode adds some of them, then remove them from the .py variable. 5330 5331 * UnicodeData.txt changes 5332 - No more manual changes for CJK ranges for algorithmic names; 5333 those are now written to ppucd.txt and genprops reads them from there. 5334 5335 * generate core properties data files (makeprops.sh was deleted) 5336 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 5337 5338 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 5339 - it is now generated by preparseucd.py 5340 5341 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 5342 - it is now generated by preparseucd.py 5343 - make sure that the Unicode data folder passed into preparseucd.py 5344 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 5345 (can be in some subfolder) 5346 5347 * generate normalization data files 5348 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 5349 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 5350 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 5351 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 5352 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 5353 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 5354 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 5355 5356 * build ICU (make install) 5357 * build Unicode tools using CMake+make 5358 5359 * new way to call genuca (makeuca.sh was deleted) 5360 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 5361 5362 ---------------------------------------------------------------------------- *** 5363 5364 Unicode 6.1 update 5365 5366 *** ICU Trac 5367 5368 - ticket 8995 final update to Unicode 6.1 5369 - ticket 8994 regenerate source/layout/CanonData.cpp 5370 5371 - ticket 8961 support Unicode "Age" value *names* 5372 - ticket 8963 support multiple character name aliases & types 5373 5374 - ticket 8827 "update ICU to Unicode 6.1" 5375 - C++ branches/markus/uni61 at r30864 from trunk at r30843 5376 - Java branches/markus/uni61 at r30865 from trunk at r30863 5377 5378 *** Unicode version numbers 5379 - makedata.mak 5380 - uchar.h 5381 (configure.in & configure: have been modified to extract the version from uchar.h) 5382 - com.ibm.icu.util.VersionInfo 5383 - icutools/unicode/makedefs.sh 5384 + also review & update other definitions in that file, 5385 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 5386 5387 *** data files & enums & parser code 5388 5389 * file preparation 5390 5391 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 5392 - This prepares both unidata and testdata files in respective output subfolders. 5393 - Check test file diffs for previously commented-out, known-failing data lines; 5394 probably need to keep those commented out. 5395 5396 * PropertyValueAliases.txt changes 5397 - 11 new block names: 5398 Arabic_Extended_A 5399 Arabic_Mathematical_Alphabetic_Symbols 5400 Chakma 5401 Meetei_Mayek_Extensions 5402 Meroitic_Cursive 5403 Meroitic_Hieroglyphs 5404 Miao 5405 Sharada 5406 Sora_Sompeng 5407 Sundanese_Supplement 5408 Takri 5409 -> add to uchar.h 5410 -> add to UCharacter.UnicodeBlock IDs 5411 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 5412 replace public static final int \1_ID = \2; \3 5413 -> add to UCharacter.UnicodeBlock objects 5414 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 5415 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5416 - 1 new Joining_Group (jg) value: 5417 Rohingya_Yeh 5418 -> uchar.h & UCharacter.JoiningGroup 5419 - 2 new Line_Break (lb) values: 5420 CJ=Conditional_Japanese_Starter 5421 HL=Hebrew_Letter 5422 -> uchar.h & UCharacter.LineBreak 5423 - 7 new scripts: 5424 sc ; Cakm ; Chakma 5425 sc ; Merc ; Meroitic_Cursive 5426 sc ; Mero ; Meroitic_Hieroglyphs 5427 sc ; Plrd ; Miao 5428 sc ; Shrd ; Sharada 5429 sc ; Sora ; Sora_Sompeng 5430 sc ; Takr ; Takri 5431 -> remove these from SyntheticPropertyValueAliases.txt 5432 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 5433 and in com.ibm.icu.dev.test.lang.TestUScript.java 5434 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5435 (added 2011-06-21) 5436 Khoj 322 Khojki 5437 Tirh 326 Tirhuta 5438 and another one added 2011-12-09 5439 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 5440 -> uscript.h 5441 -> com.ibm.icu.lang.UScript 5442 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5443 replace public static final int \1 = \2;\3 5444 -> SyntheticPropertyValueAliases.txt 5445 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5446 and in com.ibm.icu.dev.test.lang.TestUScript.java 5447 5448 * UnicodeData.txt changes 5449 - the last Unihan code point changes from U+9FCB to U+9FCC 5450 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 5451 + do change gennames.c 5452 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 5453 5454 * DerivedBidiClass.txt changes 5455 - 2 new default-AL blocks: 5456 # Arabic Extended-A: U+08A0 - U+08FF (was default-R) 5457 # Arabic Mathematical Alphabetic Symbols: 5458 # U+1EE00 - U+1EEFF (was default-R) 5459 - 2 new default-R blocks: 5460 # Meroitic Hieroglyphs: 5461 # U+10980 - U+1099F 5462 # Meroitic Cursive: U+109A0 - U+109FF 5463 -> should be picked up by the explicit data in the file 5464 5465 * NameAliases.txt changes 5466 - from 5467 # Each line has two fields 5468 # First field: Code point 5469 # Second field: Alias 5470 - to 5471 # Each line has three fields, as described here: 5472 # 5473 # First field: Code point 5474 # Second field: Alias 5475 # Third field: Type 5476 - Also, the file previously allowed multiple aliases but only now does it 5477 actually provide multiple, even multiple of the same type. For example, 5478 FEFF;BYTE ORDER MARK;alternate 5479 FEFF;BOM;abbreviation 5480 FEFF;ZWNBSP;abbreviation 5481 - This breaks our gennames parser, unames.icu data structure, and API. 5482 Fix gennames to only pick up "correction" aliases. 5483 New ticket #8963 for further changes. 5484 5485 * run genpname/preparse.pl (on Linux) 5486 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5487 + make sure that data.h is writable 5488 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5489 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5490 5491 * build ICU (make install) 5492 so that the tools build can pick up the new definitions from the installed header files. 5493 * build Unicode tools (at least genpname) using CMake+make 5494 5495 * run genpname 5496 (builds both pnames.icu and propname_data.h) 5497 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5498 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 5499 5500 * build ICU (make install) 5501 * build Unicode tools using CMake+make 5502 5503 * update source/data/unidata/norm2/nfkc_cf.txt 5504 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 5505 5506 * update source/data/unidata/norm2/uts46.txt 5507 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 5508 to ~/svn.icu/tools/trunk/src/unicode/py 5509 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 5510 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 5511 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 5512 5513 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5514 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5515 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5516 - Unicode 6.0..6.1: U+2260, U+226E, U+226F 5517 - nothing new in 6.1, no test file to update 5518 5519 * generate core properties data files 5520 - in initial bootstrapping, change the UCA version 5521 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 5522 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5523 - rebuild ICU & tools 5524 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 5525 check if the UCA version in FractionalUCA.txt matches the new Unicode version 5526 (see step above) 5527 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 5528 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5529 - rebuild ICU & tools 5530 5531 * update Java data files 5532 - refresh just the UCD-related files, just to be safe 5533 - see (ICU4C)/source/data/icu4j-readme.txt 5534 - mkdir /tmp/icu4j 5535 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5536 output: 5537 ... 5538 Unicode .icu files built to ./out/build/icudt49l 5539 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 5540 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 5541 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5542 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 5543 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 5544 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 5545 mkdir -p /tmp/icu4j/main/shared/data 5546 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5547 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 5548 mkdir -p /tmp/icu4j/main/shared/data 5549 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 5550 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 5551 - copy the big-endian Unicode data files to another location, 5552 separate from the other data files 5553 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5554 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 5555 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 5556 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 5557 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 5558 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5559 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 5560 - refresh ICU4J 5561 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 5562 5563 * refresh Java test .txt files 5564 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5565 5566 * test ICU so far, fix test code where necessary 5567 - temporarily ignore collation issues that look like UCA/UCD mismatches, 5568 until UCA data is updated 5569 5570 * UCA 5571 5572 - get output from Mark's tools; look in 5573 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 5574 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5575 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5576 (note removing the underscore before "Rules") 5577 - update (ICU)/source/test/testdata/CollationTest_*.txt 5578 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5579 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 5580 - check test file diffs for previously commented-out, known-failing data lines; 5581 probably need to keep those commented out 5582 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 5583 - run makeuca.sh: 5584 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5585 - rebuild ICU4C 5586 - refresh ICU4J collation data: 5587 (subset of instructions above for properties data refresh, except copies all coll/*) 5588 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5589 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5590 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5591 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 5592 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 5593 - note on intltest: if collate/UCAConformanceTest fails, then 5594 utility/MultithreadTest/TestCollators will fail as well; 5595 fix the conformance test before looking into the multi-thread test 5596 5597 * When refreshing all of ICU4J data from ICU4C 5598 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5599 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5600 or 5601 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5602 5603 *** LayoutEngine script information 5604 5605 (For details see the Unicode 5.2 change log below.) 5606 5607 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 5608 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 5609 in the working directory. 5610 (It also generates ScriptRunData.cpp, which is no longer needed.) 5611 5612 The generated files have a current copyright date and "@draft" statement. 5613 5614 - diff current <icu>/source/layout files vs. generated ones 5615 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 5616 review and manually merge desired changes; 5617 fix gratuitous changes, incorrect @draft and missing aliases; 5618 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 5619 - if you just copy the above files, then 5620 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 5621 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5622 5623 *** merge the Unicode update branches back onto the trunk 5624 - do not merge the icudata.jar and testdata.jar, 5625 instead rebuild them from merged & tested ICU4C 5626 5627 ---------------------------------------------------------------------------- *** 5628 5629 ICU 4.8 (no Unicode update, just new script codes) 5630 5631 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5632 (added 2010-12-21) 5633 Afak 439 Afaka 5634 Jurc 510 Jurchen 5635 Mroo 199 Mro, Mru 5636 Nshu 499 Nüshu 5637 Shrd 319 Sharada, Śāradā 5638 Sora 398 Sora Sompeng 5639 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 5640 Tang 520 Tangut 5641 Wole 480 Woleai 5642 -> uscript.h 5643 -> com.ibm.icu.lang.UScript 5644 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5645 replace public static final int \1 = \2;\3 5646 -> genpname/SyntheticPropertyValueAliases.txt 5647 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5648 and in com.ibm.icu.dev.test.lang.TestUScript.java 5649 5650 * run genpname/preparse.pl (on Linux) 5651 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5652 + make sure that data.h is writable 5653 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5654 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5655 5656 * rebuild Unicode tools (at least genpname) using make 5657 - You might first need to "make install" ICU so that the tools build can pick 5658 up the new definitions from the installed header files. 5659 5660 * run genpname 5661 (builds both pnames.icu and propname_data.h) 5662 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5663 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 5664 - rebuild ICU & tools 5665 5666 * run genprops 5667 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 5668 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 5669 - rebuild ICU & tools 5670 5671 * update Java data files 5672 - refresh just the UCD-related files, just to be safe 5673 - see (ICU4C)/source/data/icu4j-readme.txt 5674 - mkdir /tmp/icu4j 5675 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5676 - copy the big-endian Unicode data files to another location, 5677 separate from the other data files 5678 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5679 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5680 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5681 - refresh ICU4J 5682 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 5683 5684 * should have updated the layout engine script codes but forgot 5685 5686 ---------------------------------------------------------------------------- *** 5687 5688 Unicode 6.0 update 5689 5690 *** related ICU Trac tickets 5691 5692 7264 Unicode 6.0 Update 5693 5694 *** Unicode version numbers 5695 - makedata.mak 5696 - uchar.h 5697 (configure.in & configure: have been modified to extract the version from uchar.h) 5698 - com.ibm.icu.util.VersionInfo 5699 5700 *** data files & enums & parser code 5701 5702 * file preparation 5703 5704 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 5705 - This now prepares both unidata and testdata files in respective output subfolders. 5706 5707 * PropertyAliases.txt changes 5708 - new Script_Extensions property defined in the new ScriptExtensions.txt file 5709 but not listed in PropertyAliases.txt; reported to unicode.org; 5710 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 5711 scx; Script_Extensions 5712 -> uchar.h with new UProperty section 5713 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 5714 5715 * PropertyValueAliases.txt changes 5716 - 12 new block names: 5717 Alchemical_Symbols 5718 Bamum_Supplement 5719 Batak 5720 Brahmi 5721 CJK_Unified_Ideographs_Extension_D 5722 Emoticons 5723 Ethiopic_Extended_A 5724 Kana_Supplement 5725 Mandaic 5726 Miscellaneous_Symbols_And_Pictographs 5727 Playing_Cards 5728 Transport_And_Map_Symbols 5729 -> add to uchar.h 5730 -> add to UCharacter.UnicodeBlock 5731 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 5732 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5733 - Joining_Group (jg) values: 5734 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 5735 -> uchar.h & UCharacter.JoiningGroup 5736 - 3 new scripts: 5737 sc ; Batk ; Batak 5738 sc ; Brah ; Brahmi 5739 sc ; Mand ; Mandaic 5740 -> remove these from SyntheticPropertyValueAliases.txt 5741 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 5742 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 5743 and in com.ibm.icu.dev.test.lang.TestUScript.java 5744 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5745 (added 2009-11-11..2010-07-18) 5746 Bass 259 Bassa Vah 5747 Dupl 755 Duployan shortand 5748 Elba 226 Elbasan 5749 Gran 343 Grantha 5750 Kpel 436 Kpelle 5751 Loma 437 Loma 5752 Mend 438 Mende 5753 Merc 101 Meroitic Cursive 5754 Narb 106 Old North Arabian 5755 Nbat 159 Nabataean 5756 Palm 126 Palmyrene 5757 Sind 318 Sindhi 5758 Wara 262 Warang Citi 5759 -> uscript.h 5760 -> com.ibm.icu.lang.UScript 5761 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5762 replace public static final int \1 = \2;\3 5763 -> SyntheticPropertyValueAliases.txt 5764 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5765 and in com.ibm.icu.dev.test.lang.TestUScript.java 5766 - ISO 15924 name change 5767 Mero 100 Meroitic Hieroglyphs (was Meroitic) 5768 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 5769 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 5770 5771 * UnicodeData.txt changes 5772 - new CJK block: 5773 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 5774 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 5775 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 5776 5777 * build Unicode tools using CMake+make 5778 5779 * run genpname/preparse.pl (on Linux) 5780 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5781 + make sure that data.h is writable 5782 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5783 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5784 5785 * rebuild Unicode tools (at least genpname) using make 5786 - You might first need to "make install" ICU so that the tools build can pick 5787 up the new definitions from the installed header files. 5788 5789 * run genpname 5790 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5791 - rebuild ICU & tools 5792 5793 * update source/data/unidata/norm2/nfkc_cf.txt 5794 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 5795 5796 * update source/data/unidata/norm2/uts46.txt 5797 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 5798 to ~/svn.icu/tools/trunk/src/unicode/py 5799 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 5800 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 5801 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 5802 5803 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5804 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5805 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5806 - Unicode 6.0: U+2260, U+226E, U+226F 5807 5808 * generate core properties data files 5809 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5810 - rebuild ICU & tools 5811 - run makeuca.sh so that genuca picks up the new nfc.nrm: 5812 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5813 - rebuild ICU & tools 5814 5815 * implement new Script_Extensions property (provisional) 5816 - parser & generator: genprops & uprops.icu 5817 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 5818 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 5819 5820 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 5821 - (one-time change) 5822 - genbidi/gencase/genprops tools changes 5823 - re-run makeprops.sh (see above) 5824 - UCharacterProperty.java, UCharacterTypeIterator.java, 5825 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 5826 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 5827 5828 * update Java data files 5829 - refresh just the UCD-related files, just to be safe 5830 - see (ICU4C)/source/data/icu4j-readme.txt 5831 - mkdir /tmp/icu4j 5832 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5833 output: 5834 ... 5835 Unicode .icu files built to ./out/build/icudt45l 5836 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 5837 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5838 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 5839 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 5840 mkdir -p /tmp/icu4j/main/shared/data 5841 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5842 - copy the big-endian Unicode data files to another location, 5843 separate from the other data files 5844 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5845 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 5846 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 5847 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 5848 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 5849 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5850 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 5851 - refresh ICU4J 5852 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 5853 5854 * refresh Java test .txt files 5855 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5856 5857 * un-hardcode normalization skippable (NF*_Inert) test data 5858 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 5859 5860 * copy updated break iterator test files 5861 - now handled by early ucdcopy.py and 5862 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 5863 (old instructions: 5864 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 5865 to ~/svn.icu/trunk/src/source/test/testdata) 5866 - they are not used in ICU4J 5867 5868 * UCA 5869 5870 - get output from Mark's tools; look in 5871 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 5872 http://www.macchiato.com/unicode/utc/additional-uca-files 5873 http://www.unicode.org/Public/UCA/6.0.0/ 5874 http://www.unicode.org/~mdavis/uca/ 5875 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5876 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5877 - update Han-implicit ranges for new CJK extensions: 5878 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 5879 - genuca: allow bytes 02 for U+FFFE, new merge-sort character; 5880 do not add it into invuca so that tailoring primary-after an ignorable works 5881 - genuca: permit space between [variable top] bytes 5882 - ucol.cpp: treat noncharacters like unassigned rather than ignorable 5883 - run makeuca.sh: 5884 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5885 - rebuild ICU4C 5886 - refresh ICU4J collation data: 5887 (subset of instructions above for properties data refresh, except copies all coll/*) 5888 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5889 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5890 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5891 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 5892 - update (ICU)/source/test/testdata/CollationTest_*.txt 5893 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5894 with output from Mark's Unicode tools 5895 - run all tests with the *_SHORT.txt or the full files (the full ones have comments) 5896 - note on intltest: if collate/UCAConformanceTest fails, then 5897 utility/MultithreadTest/TestCollators will fail as well; 5898 fix the conformance test before looking into the multi-thread test 5899 5900 * When refreshing all of ICU4J data from ICU4C 5901 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5902 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5903 or 5904 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5905 5906 *** LayoutEngine script information 5907 5908 (For details see the Unicode 5.2 change log below.) 5909 5910 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 5911 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 5912 ScriptRunData.cpp, which is no longer needed.) 5913 5914 The generated files have a current copyright date and "@draft" statement. 5915 5916 * copy the above files into <icu>/source/layout, replacing the old files. 5917 * fix mixed line endings 5918 * review the diffs and fix incorrect @draft and missing aliases; 5919 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 5920 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5921 5922 ---------------------------------------------------------------------------- *** 5923 5924 Unicode 5.2 update 5925 5926 *** related ICU Trac tickets 5927 5928 7084 Unicode 5.2 5929 5930 7167 verify collation bytes 5931 7235 Java test NAME_ALIAS 5932 7236 Java DerivedCoreProperties.txt test 5933 7237 Java BidiTest.txt 5934 7238 UTrie2 in core unidata 5935 7239 test for tailoring gaps 5936 7240 Java fix CollationMiscTest 5937 7243 update layout engine for Unicode 5.2 5938 5939 *** Unicode version numbers 5940 - makedata.mak 5941 - uchar.h 5942 - configure.in & configure 5943 - update ucdVersion in gennames.c if an algorithmic range changes 5944 5945 *** data files & enums & parser code 5946 5947 * file preparation 5948 5949 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 5950 - includes finding files regardless of version numbers, 5951 copying them, and performing the equivalent processing of the 5952 ucdstrip and ucdmerge tools on the desired set of files 5953 5954 * notes on changes 5955 - PropertyAliases.txt 5956 moved from numeric to enumerated: 5957 ccc ; Canonical_Combining_Class 5958 new string properties: 5959 NFKC_CF ; NFKC_Casefold 5960 Name_Alias; Name_Alias 5961 new binary properties: 5962 Cased ; Cased 5963 CI ; Case_Ignorable 5964 CWCF ; Changes_When_Casefolded 5965 CWCM ; Changes_When_Casemapped 5966 CWKCF ; Changes_When_NFKC_Casefolded 5967 CWL ; Changes_When_Lowercased 5968 CWT ; Changes_When_Titlecased 5969 CWU ; Changes_When_Uppercased 5970 new CJK Unihan properties (not supported by ICU) 5971 - PropertyValueAliases.txt 5972 new block names 5973 new scripts 5974 one script code change: 5975 sc ; Qaai ; Inherited 5976 -> 5977 sc ; Zinh ; Inherited ; Qaai 5978 new Line_Break (lb) value: 5979 lb ; CP ; Close_Parenthesis 5980 new Joining_Group (jg) values: Farsi_Yeh, Nya 5981 other new values: 5982 ccc; 214; ATA ; Attached_Above 5983 - DerivedBidiClass.txt 5984 new default-R range: U+1E800 - U+1EFFF 5985 - UnicodeData.txt 5986 all of the ISO comments are gone 5987 new CJK block end: 5988 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 5989 new CJK block: 5990 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 5991 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 5992 5993 * genpname 5994 - run preparse.pl 5995 + cd \svn\icuproj\icu\trunk\source\tools\genpname 5996 + make sure that data.h is writable 5997 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 5998 + preparse.pl complains with errors like the following: 5999 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 6000 This is because ICU 4.0 had scripts from ISO 15924 which are now 6001 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 6002 and PropertyValueAliases.txt. 6003 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 6004 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 6005 + preparse.pl complains with errors about block names missing from uchar.h; add them 6006 6007 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 6008 - new block & script values 6009 + 26 new blocks 6010 copy new blocks from Blocks.txt 6011 MS VC++ 2008 regular expression: 6012 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 6013 replace with " UBLOCK_\3 = 172, /*[\1]*/" 6014 + several new script values already added in ICU 4.0 for ISO 15924 coverage 6015 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 6016 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 6017 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 6018 (added to SyntheticPropertyValueAliases.txt) 6019 - new Joining Group (JG) values: Farsi_Yeh, Nya 6020 - new Line_Break (lb) value: 6021 lb ; CP ; Close_Parenthesis 6022 6023 * hardcoded Unihan range end/limit 6024 - Unihan range end moves from 9FC3 to 9FCB 6025 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 6026 + do change gennames.c 6027 6028 * Compare definitions of new binary properties with what we used to use 6029 in algorithms, to see if the definitions changed. 6030 - Verified that definitions for Cased and Case_Ignorable are unchanged. 6031 The gencase tool now parses the newly public Case_Ignorable values 6032 in case the definition changes in the future. 6033 6034 * uchar.c & uprops.h & uprops.c & genprops 6035 - new numeric values that didn't exist in Unicode data before: 6036 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 6037 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 6038 therefore redesign the encoding of numeric types and values for formatVersion 6; 6039 design for simple numbers up to at least 144 ("one gross"), 6040 large values up to at least 10^20, 6041 and fractions with numerators -1..17 and denominators 1..16 6042 to cover current and expected future values 6043 (e.g., more Han numeric values, Meroitic twelfths) 6044 6045 * reimplement Hangul_Syllable_Type for new Jamo characters 6046 - the old code assumed that all Jamo characters are in the 11xx block 6047 - Unicode 5.2 fills holes there and adds new Jamo characters in 6048 A960..A97F; Hangul Jamo Extended-A 6049 and in 6050 D7B0..D7FF; Hangul Jamo Extended-B 6051 - Hangul_Syllable_Type can be trivially derived from a subset of 6052 Grapheme_Cluster_Break values 6053 6054 * build Unicode data source code for hardcoding core data 6055 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 6056 6057 ICU data make path is \svn\icuproj\icu\trunk\source\data\ 6058 ICU root path is \svn\icuproj\icu\trunk 6059 Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 6060 Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 6061 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 6062 Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 6063 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 6064 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 6065 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 6066 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 6067 Creating data file for Unicode Property Names 6068 Creating data file for Unicode Character Properties 6069 Creating data file for Unicode Case Mapping Properties 6070 Creating data file for Unicode BiDi/Shaping Properties 6071 Creating data file for Unicode Normalization 6072 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 6073 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 6074 6075 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 6076 and rebuild the common library 6077 6078 *** UCA 6079 6080 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 6081 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 6082 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 6083 [ Begin obsolete instructions: 6084 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 6085 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 6086 on Windows: 6087 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 6088 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 6089 End obsolete instructions] 6090 - run all tests with the *_SHORT.txt or the full files (the full ones have comments) 6091 not just the *_STUB.txt files 6092 - note on intltest: if collate/UCAConformanceTest fails, then 6093 utility/MultithreadTest/TestCollators will fail as well; 6094 fix the conformance test before looking into the multi-thread test 6095 6096 *** Implement Cased & Case_Ignorable properties 6097 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 6098 - Problem: These properties should be disjoint, but aren't 6099 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 6100 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable 6101 6102 *** Implement Changes_When_Xyz properties 6103 - without stored data 6104 6105 *** Implement Name_Alias property 6106 - add it as another name field in unames.icu 6107 - make it available via u_charName() and UCharNameChoice and 6108 - consider it in u_charFromName() 6109 6110 *** Break iterators 6111 6112 * Update break iterator rules to new UAX versions and new property values 6113 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 6114 6115 *** new BidiTest file 6116 - review format and data 6117 - copy BidiTest.txt to source/test/testdata 6118 - write test code using this data 6119 - fix ICU code where it fails the conformance test 6120 6121 *** Java 6122 - generally, find and update code corresponding to C/C++ 6123 - UCharacter.UnicodeBlock constants: 6124 a) add an _ID integer per new block, update COUNT 6125 b) add a class instance per new block 6126 Visual Studio regex: 6127 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 6128 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 6129 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 6130 6131 - port test changes to Java 6132 6133 *** LayoutEngine script information 6134 6135 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 6136 6137 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 6138 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 6139 ScriptRunData.cpp, which is no longer needed.) 6140 6141 The generated files have a current copyright date and "@draft" statement. 6142 6143 -> Eric Mader wrote in email on 20090930: 6144 "I think the tool has been modified to update @draft to @stable for 6145 older scripts and to add @draft for new scripts. 6146 (I worked with an intern on this last year.) 6147 You should check the output after you run it." 6148 6149 * copy the above files into <icu>/source/layout, replacing the old files. 6150 * fix mixed line endings 6151 * review the diffs and fix incorrect @draft and missing aliases 6152 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 6153 6154 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 6155 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 6156 6157 -> Eric Mader wrote in email on 20090930: 6158 "This is just a matter of making sure that all the per-script tables have 6159 entries for any new scripts that were added. 6160 If any new Indic characters were added, then the class tables in 6161 IndicClassTables.cpp should be updated to reflect this. 6162 John Emmons should know how to do this if it's required." 6163 6164 * rebuild the layout and layoutex libraries. 6165 6166 *** Documentation 6167 - Update User Guide 6168 + Jamo_Short_Name, sfc->scf, binary property value aliases 6169 6170 ---------------------------------------------------------------------------- *** 6171 6172 Unicode 5.1 update 6173 6174 *** related ICU Trac tickets 6175 6176 5696 Update to Unicode 5.1 6177 6178 *** Unicode version numbers 6179 - makedata.mak 6180 - uchar.h 6181 - configure.in & configure 6182 - update ucdVersion in gennames.c if an algorithmic range changes 6183 6184 *** data files & enums & parser code 6185 6186 * file preparation 6187 - ucdstrip: 6188 DerivedCoreProperties.txt 6189 DerivedNormalizationProps.txt 6190 NormalizationTest.txt 6191 PropList.txt 6192 Scripts.txt 6193 GraphemeBreakProperty.txt 6194 SentenceBreakProperty.txt 6195 WordBreakProperty.txt 6196 - ucdstrip and ucdmerge: 6197 EastAsianWidth.txt 6198 LineBreak.txt 6199 6200 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 6201 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 6202 copy 5.1.0\ucd\Blocks.txt ..\unidata\ 6203 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 6204 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 6205 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 6206 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 6207 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 6208 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 6209 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 6210 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 6211 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 6212 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 6213 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 6214 6215 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 6216 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 6217 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 6218 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 6219 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 6220 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 6221 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 6222 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 6223 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 6224 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 6225 6226 * genpname 6227 - run preparse.pl 6228 + cd \svn\icuproj\icu\uni51\source\tools\genpname 6229 + make sure that data.h is writable 6230 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 6231 + preparse.pl complains with errors like the following: 6232 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 6233 This is because ICU 3.8 had scripts from ISO 15924 which are now 6234 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 6235 and PropertyValueAliases.txt. 6236 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 6237 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 6238 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 6239 N/Y, No/Yes, F/T, False/True 6240 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 6241 It will use further values from the file if present. 6242 6243 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 6244 - new block & script values 6245 + 17 new blocks 6246 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 6247 (removed from SyntheticPropertyValueAliases.txt) 6248 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 6249 (added to SyntheticPropertyValueAliases.txt) 6250 - uprops.icu (uprops.h) only provides 7 bits for script codes. 6251 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 6252 There is none above 127 yet which is the script code for an 6253 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 6254 script code values greater than 127. 6255 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 6256 in a parallel bit field, and that overflows now. 6257 Also, future values >=128 would be incompatible anyway. 6258 uprops.h is modified to move around several of the bit fields 6259 in the properties vector words, and now uses 8 bits for the script code. 6260 Two other bit fields also grow to accommodate future growth: 6261 Block (current count: 172) grows from 8 to 9 bits, 6262 and Word_Break grows from 4 to 5 bits. 6263 - renamed property Simple_Case_Folding (sfc->scf) 6264 + nothing to be done: handled as normal alias 6265 - new property JSN Jamo_Short_Name 6266 + no new API: only contributes to the Name property 6267 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 6268 - new Joining Group (JG) value: Burushashki_Yeh_Barree 6269 - new Sentence_Break (SB) values: 6270 SB ; CR ; CR 6271 SB ; EX ; Extend 6272 SB ; LF ; LF 6273 SB ; SC ; SContinue 6274 - new Word_Break (WB) values: 6275 WB ; CR ; CR 6276 WB ; Extend ; Extend 6277 WB ; LF ; LF 6278 WB ; MB ; MidNumLet 6279 6280 * Further changes in the 2008-02-29 update: 6281 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 6282 because they should not normally be invisible. 6283 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 6284 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend 6285 - new Word_Break (WB) value: NL=Newline 6286 6287 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 6288 - Unihan range end moves from 9FBB to 9FC3 6289 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 6290 + do change gennames.c 6291 6292 * build Unicode data source code for hardcoding core data 6293 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 6294 6295 ICU data make path is \svn\icuproj\icu\uni51\source\data\ 6296 ICU root path is \svn\icuproj\icu\uni51 6297 Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 6298 Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 6299 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 6300 Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 6301 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 6302 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 6303 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 6304 Creating data file for Unicode Character Properties 6305 Creating data file for Unicode Case Mapping Properties 6306 Creating data file for Unicode BiDi/Shaping Properties 6307 Creating data file for Unicode Normalization 6308 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 6309 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 6310 6311 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 6312 and rebuild the common library 6313 6314 *** Break iterators 6315 6316 * Update break iterator rules to new UAX versions and new property values 6317 6318 *** UCA 6319 6320 * update FractionalUCA.txt and UCARules.txt with new canonical closure 6321 6322 *** Test suites 6323 - Test that APIs using Unicode property value aliases (like UnicodeSet) 6324 support all of the boolean values N/Y, No/Yes, F/T, False/True 6325 -> TestBinaryValues() tests in both cintltst and intltest 6326 6327 *** LayoutEngine script information 6328 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 6329 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 6330 ScriptRunData.cpp, which is no longer needed.) 6331 6332 The generated files have a current copyright date and "@draft" statement. 6333 6334 * copy the above files into <icu>/source/layout, replacing the old files. 6335 6336 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 6337 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 6338 6339 * rebuild the layout and layoutex libraries. 6340 6341 *** Documentation 6342 - Update User Guide 6343 + Jamo_Short_Name, sfc->scf, binary property value aliases 6344 6345 ---------------------------------------------------------------------------- *** 6346 6347 Unicode 5.0 update 6348 6349 *** related Jitterbugs 6350 6351 5084 RFE: Update to Unicode 5.0 6352 6353 *** data files & enums & parser code 6354 6355 * file preparation 6356 - ucdstrip: 6357 DerivedCoreProperties.txt 6358 DerivedNormalizationProps.txt 6359 NormalizationTest.txt 6360 PropList.txt 6361 Scripts.txt 6362 GraphemeBreakProperty.txt 6363 SentenceBreakProperty.txt 6364 WordBreakProperty.txt 6365 - ucdstrip and ucdmerge: 6366 EastAsianWidth.txt 6367 LineBreak.txt 6368 6369 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 6370 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 6371 copy 5.0.0\ucd\Blocks.txt ..\unidata\ 6372 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 6373 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 6374 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 6375 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 6376 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 6377 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 6378 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 6379 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 6380 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 6381 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 6382 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 6383 6384 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 6385 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 6386 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 6387 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 6388 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 6389 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 6390 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 6391 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 6392 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 6393 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 6394 6395 * update FractionalUCA.txt and UCARules.txt with new canonical closure 6396 6397 * genpname 6398 - run preparse.pl 6399 + make sure that data.h is writable 6400 + perl preparse.pl \cvs\oss\icu > out.txt 6401 6402 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 6403 - new block & script values 6404 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 6405 6406 * build Unicode data source code for hardcoding core data 6407 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 6408 6409 ICU data make path is \cvs\oss\icu\source\data\ 6410 ICU root path is \cvs\oss\icu 6411 Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 6412 [etc.] 6413 Creating data file for Unicode Character Properties 6414 Creating data file for Unicode Case Mapping Properties 6415 Creating data file for Unicode BiDi/Shaping Properties 6416 Creating data file for Unicode Normalization 6417 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 6418 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 6419 6420 - copy the .c source files to C:\cvs\oss\icu\source\common 6421 and rebuild the common library 6422 6423 *** Unicode version numbers 6424 - makedata.mak 6425 - uchar.h 6426 - configure.in 6427 6428 *** LayoutEngine script information 6429 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 6430 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 6431 ScriptRunData.cpp, which is no longer needed.) 6432 6433 The generated files have a current copyright date and "@draft" statement. 6434 6435 * copy the above files into <icu>/source/layout, replacing the old files. 6436 6437 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 6438 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 6439 6440 * rebuild the layout and layoutex libraries. 6441 6442 ---------------------------------------------------------------------------- *** 6443 6444 Unicode 4.1 update 6445 6446 *** related Jitterbugs 6447 6448 4332 RFE: Update to Unicode 4.1 6449 4157 RBBI, TR29 4.1 updates 6450 6451 *** data files & enums & parser code 6452 6453 * file preparation 6454 - ucdstrip: 6455 DerivedCoreProperties.txt 6456 DerivedNormalizationProps.txt 6457 NormalizationTest.txt 6458 GraphemeBreakProperty.txt 6459 SentenceBreakProperty.txt 6460 WordBreakProperty.txt 6461 - ucdstrip and ucdmerge: 6462 EastAsianWidth.txt 6463 LineBreak.txt 6464 6465 * add new files to the repository 6466 GraphemeBreakProperty.txt 6467 SentenceBreakProperty.txt 6468 WordBreakProperty.txt 6469 6470 * update FractionalUCA.txt and UCARules.txt with new canonical closure 6471 6472 * genpname 6473 - handle new enumerated properties in sub read_uchar 6474 - run preparse.pl 6475 6476 * uchar.h & uscript.h & uprops.h & uprops.c & genprops 6477 - new binary properties 6478 + Pattern_Syntax 6479 + Pattern_White_Space 6480 - new enumerated properties 6481 + Grapheme_Cluster_Break 6482 + Sentence_Break 6483 + Word_Break 6484 - new block & script & line break values 6485 6486 * gencase 6487 - case-ignorable changes 6488 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 6489 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 6490 6491 *** Unicode version numbers 6492 - makedata.mak 6493 - uchar.h 6494 - configure.in 6495 6496 *** tests 6497 - verify that u_charMirror() round-trips 6498 - test all new properties and some new values of old properties 6499 6500 *** other code 6501 6502 * hardcoded Unihan range end/limit 6503 - Unihan range end moves from 9FA5 to 9FBB 6504 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 6505 + do not modify BOCU/BOCSU code because that would change the encoding 6506 and break binary compatibility! 6507 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 6508 NamePrepProfile.txt 6509 + ignore trietest.c: test data is arbitrary 6510 + ignore tstnorm.cpp: test optimization, not important 6511 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 6512 + do change line_th.txt and word_th.txt 6513 by replacing hardcoded ranges with the new property values 6514 + do change gennames.c 6515 6516 source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 6517 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 6518 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 6519 6520 * case mappings 6521 - compare new special casing context conditions with previous ones 6522 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 6523 6524 * genpname 6525 - consider storing only the short name if it is the same as the long name 6526 6527 *** other reviews 6528 - UAX #29 changes (grapheme/word/sentence breaks) 6529 - UAX #14 changes (line breaks) 6530 - Pattern_Syntax & Pattern_White_Space 6531 6532 ---------------------------------------------------------------------------- *** 6533 6534 Unicode 4.0.1 update 6535 6536 *** related Jitterbugs 6537 6538 3170 RFE: Update to Unicode 4.0.1 6539 3171 Add new Unicode 4.0.1 properties 6540 3520 use Unicode 4.0.1 updates for break iteration 6541 6542 *** data files & enums & parser code 6543 6544 * file preparation 6545 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 6546 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 6547 6548 * file fixes 6549 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No 6550 according to PRI #26 6551 http://www.unicode.org/review/resolved-pri.html#pri26 6552 - undone again because no corrigendum in sight; 6553 instead modified tests to not check consistency on this for Unicode 4.0.1 6554 6555 * ucdterms.txt 6556 - update from http://www.unicode.org/copyright.html 6557 formatted for plain text 6558 6559 * uchar.h & uprops.h & uprops.c & genprops 6560 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 6561 - add U_LB_INSEPARABLE due to a spelling fix 6562 + put short name comment only on line with new constant 6563 for genpname perl script parser 6564 - new binary properties 6565 + STerm 6566 + Variation_Selector 6567 6568 * genpname 6569 - fix genpname perl script so that it doesn't choke on more than 2 names per property value 6570 - perl script: correctly calculate the maximum number of fields per row 6571 6572 * uscript.h 6573 - new script code Hrkt=Katakana_Or_Hiragana 6574 6575 * gennorm.c track changes in DerivedNormalizationProps.txt 6576 - "FNC" -> "FC_NFKC" 6577 - single field "NFD_NO" -> two fields "NFD_QC; N" etc. 6578 6579 * genprops/props2.c track changes in DerivedNumericValues.txt 6580 - changed from 3 columns to 2, dropping the numeric type 6581 + assume that the type is always numeric for Han characters, 6582 and that only those are added in addition to what UnicodeData.txt lists 6583 6584 *** Unicode version numbers 6585 - makedata.mak 6586 - uchar.h 6587 - configure.in 6588 6589 *** tests 6590 - update test of default bidi classes according to PRI #28 6591 /tsutil/cucdtst/TestUnicodeData 6592 http://www.unicode.org/review/resolved-pri.html#pri28 6593 - bidi tests: change exemplar character for ES depending on Unicode version 6594 - change hardcoded expected property values where they change 6595 6596 *** other code 6597 6598 * name matching 6599 - read UCD.html 6600 6601 * scripts 6602 - use new Hrkt=Katakana_Or_Hiragana 6603 6604 * ZWJ & ZWNJ 6605 - are now part of combining character sequences 6606 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ