uconv.1.in (10537B)
1 .\" Hey, Emacs! This is -*-nroff-*- you know... 2 .\" 3 .\" uconv.1: manual page for the uconv utility. 4 .\" 5 .\" Copyright (C) 2016 and later: Unicode, Inc. and others. 6 .\" License & terms of use: http://www.unicode.org/copyright.html 7 .\" Copyright (C) 2000-2013 IBM, Inc. and others. 8 .\" 9 .\" Manual page by Yves Arrouye <yves@realnames.com>. 10 .\" 11 .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual" 12 .SH NAME 13 .B uconv 14 \- convert data from one encoding to another 15 .SH SYNOPSIS 16 .B uconv 17 [ 18 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help" 19 ] 20 [ 21 .BI "\-V\fP, \fB\-\-version" 22 ] 23 [ 24 .BI "\-s\fP, \fB\-\-silent" 25 ] 26 [ 27 .BI "\-v\fP, \fB\-\-verbose" 28 ] 29 [ 30 .BI "\-l\fP, \fB\-\-list" 31 | 32 .BI "\-l\fP, \fB\-\-list\-code" " code" 33 | 34 .BI "\-\-default-code" 35 | 36 .BI "\-L\fP, \fB\-\-list\-transliterators" 37 ] 38 [ 39 .BI "\-\-canon" 40 ] 41 [ 42 .BI "\-x" " transliteration 43 ] 44 [ 45 .BI "\-\-to\-callback" " callback" 46 | 47 .B "\-c" 48 ] 49 [ 50 .BI "\-\-from\-callback" " callback" 51 | 52 .B "\-i" 53 ] 54 [ 55 .BI "\-\-callback" " callback" 56 ] 57 [ 58 .BI "\-\-fallback" 59 | 60 .BI "\-\-no\-fallback" 61 ] 62 [ 63 .BI "\-b\fP, \fB\-\-block\-size" " size" 64 ] 65 [ 66 .BI "\-f\fP, \fB\-\-from\-code" " encoding" 67 ] 68 [ 69 .BI "\-t\fP, \fB\-\-to\-code" " encoding" 70 ] 71 [ 72 .BI "\-\-add\-signature" 73 ] 74 [ 75 .BI "\-\-remove\-signature" 76 ] 77 [ 78 .BI "\-o\fP, \fB\-\-output" " file" 79 ] 80 [ 81 .IR file .\|.\|. 82 ] 83 .SH DESCRIPTION 84 .B uconv 85 converts, or transcodes, each given 86 .I file 87 (or its standard input if no 88 .I file 89 is specified) from one 90 .I encoding 91 to another. 92 The transcoding is done using Unicode as a pivot encoding 93 (i.e. the data are first transcoded from their original encoding to 94 Unicode, and then from Unicode to the destination encoding). 95 .PP 96 If an 97 .I encoding 98 is not specified or is 99 .BR - , 100 the default encoding is used. Thus, calling 101 .B uconv 102 with no 103 .I encoding 104 provides an easy way to validate and sanitize data files for 105 further consumption by tools requiring data in the default encoding. 106 .PP 107 When calling 108 .BR uconv , 109 it is possible to specify callbacks that are used to handle invalid 110 characters in the input, or characters that cannot be transcoded to 111 the destination encoding. Some encodings, for example, offer a default 112 substitution character that can be used to represent the occurrence of 113 such characters in the input. Other callbacks offer a useful visual 114 representation of the invalid data. 115 .PP 116 .B uconv 117 can also run the specified 118 .IR transliteration 119 on the transcoded data, 120 in which case transliteration will happen as an intermediate step, 121 after the data have been transcoded to Unicode. 122 The 123 .I transliteration 124 can be either a list of semicolon-separated transliterator names, 125 or an arbitrarily complex set of rules in the ICU transliteration 126 rules format. 127 .PP 128 For transcoding purposes, 129 .B uconv 130 options are compatible with those of 131 .BR iconv (1), 132 making it easy to replace it in scripts. It is not necessarily the case, 133 however, that the encoding names used by 134 .B uconv 135 and ICU are the same as the ones used by 136 .BR iconv (1). 137 Also, options that provide informational data, such as the 138 .B \-l\fP, \fB\-\-list 139 one offered by some 140 .BR iconv (1) 141 variants such as GNU's, produce data in a slightly different and 142 easier to parse format. 143 .SH OPTIONS 144 .TP 145 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help" 146 Print help about usage and exit. 147 .TP 148 .BR "\-V\fP, \fB\-\-version" 149 Print the version of 150 .B uconv 151 and exit. 152 .TP 153 .BI "\-s\fP, \fB\-\-silent" 154 Suppress messages during execution. 155 .TP 156 .BI "\-v\fP, \fB\-\-verbose" 157 Display extra informative messages during execution. 158 .TP 159 .BI "\-l\fP, \fB\-\-list" 160 List all the available encodings and exit. 161 .TP 162 .BI "\-l\fP, \fB\-\-list\-code" " code" 163 List only the 164 .I code 165 encoding and exit. If 166 .I code 167 is not a proper encoding, exit with an error. 168 .TP 169 .BI "\-\-default-code" 170 List only the name of the default encoding and exit. 171 .TP 172 .BI "\-L\fP, \fB\-\-list\-transliterators" 173 List all the available transliterators and exit. 174 .TP 175 .BI "\--canon" 176 If used with 177 .BI "\-l\fP, \fB\-\-list" 178 or 179 .BR "\-\-default-code" , 180 the list of encodings is produced in a format compatible with 181 .BR convrtrs.txt (5). 182 If used with 183 .BR "\-L\fP, \fB\-\-list\-transliterators" , 184 print only one transliterator name per line. 185 .TP 186 .BI "\-x" " transliteration" 187 Run the given 188 .IR transliteration 189 on the transcoded Unicode data, 190 and use the transliterated data as input for the transcoding to 191 the destination encoding. 192 .TP 193 .BI "\-\-to\-callback" " callback" 194 Use 195 .I callback 196 to handle characters that cannot be transcoded to the destination 197 encoding. See section 198 .B CALLBACKS 199 for details on valid callbacks. 200 .TP 201 .B "\-c" 202 Omit invalid characters from the output. 203 Same as 204 .BR "\-\-to\-callback skip" . 205 .TP 206 .BI "\-\-from\-callback" " callback" 207 Use 208 .I callback 209 to handle characters that cannot be transcoded from the original 210 encoding. See section 211 .B CALLBACKS 212 for details on valid callbacks. 213 .TP 214 .B "\-i" 215 Ignore invalid sequences in the input. 216 Same as 217 .BR "\-\-from\-callback skip" . 218 .TP 219 .BI "\-\-callback" " callback" 220 Use 221 .I callback 222 to handle both characters that cannot be transcoded from the original 223 encoding and characters that cannot be transcoded to the destination 224 encoding. See section 225 .B CALLBACKS 226 for details on valid callbacks. 227 .TP 228 .BI "\-\-fallback" 229 Use the fallback mapping when transcoding from 230 Unicode to the destination encoding. 231 .TP 232 .BI "\-\-no\-fallback" 233 Do not use the fallback mapping when transcoding from Unicode to the 234 destination encoding. 235 This is the default. 236 .TP 237 .BI "\-b\fP, \fB\-\-block\-size" " size" 238 Read input in blocks of 239 .I size 240 bytes at a time. The default block size is 241 4096. 242 .TP 243 .BI "\-f\fP, \fB\-\-from\-code" " encoding" 244 Set the original encoding of the data to 245 .IR encoding . 246 .TP 247 .BI "\-t\fP, \fB\-\-to\-code" " encoding" 248 Transcode the data to 249 .IR encoding . 250 .TP 251 .BI "\-\-add\-signature" 252 Add a U+FEFF Unicode signature character (BOM) if the output charset 253 supports it and does not add one anyway. 254 .TP 255 .BI "\-\-remove\-signature" 256 Remove a U+FEFF Unicode signature character (BOM). 257 .TP 258 .BI "\-o\fP, \fB\-\-output" " file" 259 Write the transcoded data to 260 .IR file . 261 .SH CALLBACKS 262 .B uconv 263 supports specifying callbacks to handle invalid data. Callbacks can be 264 set for both directions of transcoding: from the original encoding to 265 Unicode, with the 266 .BR "\-\-from\-callback" 267 option, and from Unicode to the destination encoding, with the 268 .BR "\-\-to\-callback" 269 option. 270 .PP 271 The following is a list of valid 272 .I callback 273 names, along with a description of their behavior. The list of 274 callbacks actually supported by 275 .B uconv 276 is displayed when it is called with 277 .BR "\-h\fP, \fB\-\-help" . 278 .PP 279 .TP \w'\fBescape-unicode'u+3n 280 .B substitute 281 Write the encoding's substitute sequence, or the Unicode 282 replacement character 283 .B U+FFFD 284 when transcoding to Unicode. 285 .TP 286 .B skip 287 Ignore the invalid data. 288 .TP 289 .B stop 290 Stop with an error when encountering invalid data. 291 This is the default callback. 292 .TP 293 .B escape 294 Same as 295 .BR escape-icu . 296 .TP 297 .B escape-icu 298 Replace the missing characters with a string of the format 299 .BR %U\fIhhhh\fP 300 for plane 0 characters, and 301 .BR %U\fIhhhh\fP%U\fIhhhh\fP 302 for planes 1 and above characters, 303 where 304 .I hhhh 305 is the hexadecimal value of one of the UTF-16 code units representing the 306 character. Characters from planes 1 and above are written as a pair of 307 UTF-16 surrogate code units. 308 .TP 309 .B escape-java 310 Replace the missing characters with a string of the format 311 .BR \eu\fIhhhh\fP 312 for plane 0 characters, and 313 .BR \eu\fIhhhh\fP\eu\fIhhhh\fP 314 for planes 1 and above characters, 315 where 316 .I hhhh 317 is the hexadecimal value of one of the UTF-16 code units representing the 318 character. Characters from planes 1 and above are written as a pair of 319 UTF-16 surrogate code units. 320 .TP 321 .B escape-c 322 Replace the missing characters with a string of the format 323 .BR \eu\fIhhhh\fP 324 for plane 0 characters, and 325 .BR \eU\fIhhhhhhhh\fP 326 for planes 1 and above characters, 327 where 328 .I hhhh 329 and 330 .I hhhhhhhh 331 are the hexadecimal values of the Unicode codepoint. 332 .TP 333 .B escape-xml 334 Same as 335 .BR escape-xml-hex . 336 .TP 337 .B escape-xml-hex 338 Replace the missing characters with a string of the format 339 .BR &#x\fIhhhh\fP; , 340 where 341 .I hhhh 342 is the hexadecimal value of the Unicode codepoint. 343 .TP 344 .B escape-xml-dec 345 Replace the missing characters with a string of the format 346 .BR &#\fInnnn\fP; , 347 where 348 .I nnnn 349 is the decimal value of the Unicode codepoint. 350 .TP 351 .B escape-unicode 352 Replace the missing characters with a string of the format 353 .BR {U+\fIhhhh\fP} , 354 where 355 .I hhhh 356 is the hexadecimal value of the Unicode codepoint. 357 That hexadecimal string is of variable length and can use from 4 to 358 6 digits. 359 This is the format universally used to denote a Unicode codepoint in 360 the literature, delimited by curly braces for easy recognition of those 361 substitutions in the output. 362 .SH EXAMPLES 363 Convert data from a given 364 .I encoding 365 to the platform encoding: 366 367 .RS 4 368 .B \fR$ \fPuconv \-f \fIencoding\fP 369 .RE 370 .PP 371 Check if a 372 .I file 373 contains valid data for a given 374 .IR encoding : 375 376 .RS 4 377 .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null 378 .RE 379 .PP 380 Convert a UTF-8 381 .I file 382 to a given 383 .I encoding 384 and ensure that the resulting text is good for any version of HTML: 385 386 .RS 4 387 .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e 388 .br 389 .B " \-\-callback escape-xml-dec \fIfile\fP" 390 .RE 391 .PP 392 Display the names of the Unicode code points in a UTF-file: 393 394 .RS 4 395 .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP 396 .RE 397 .PP 398 Print the name of a Unicode code point whose value is known (\fBU+30AB\fP 399 in this example): 400 401 .RS 4 402 .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo 403 .br 404 {KATAKANA LETTER KA}{LINE FEED} 405 .br 406 $ 407 .RE 408 409 (The names are delimited by curly braces. 410 Also, the name of the line terminator is also displayed.) 411 .PP 412 Normalize UTF-8 data using Unicode NFKC, remove all control characters, 413 and map Katakana to Hiragana: 414 415 .RS 4 416 .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e 417 .br 418 .B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'" 419 .SH CAVEATS AND BUGS 420 .B uconv 421 does report errors as occurring at the first invalid byte 422 encountered. This may be confusing to users of GNU 423 .BR iconv (1), 424 which reports errors as occurring at the first byte of an invalid 425 sequence. For multi-byte character sets or encodings, this means that 426 .BR uconv 427 error positions may be at a later offset in the input stream than 428 would be the case with GNU 429 .BR iconv (1). 430 .PP 431 The reporting of error positions when a transliterator is used may be 432 inaccurate or unavailable, in which case 433 .BR uconv 434 will report the offset in the output stream at which the error 435 occurred. 436 .SH AUTHORS 437 Jonas Utterstroem 438 .br 439 Yves Arrouye 440 .SH VERSION 441 @VERSION@ 442 .SH COPYRIGHT 443 Copyright (C) 2000-2005 IBM, Inc. and others. 444 .SH SEE ALSO 445 .BR iconv (1)