appnote.txt (46890B)
1 Revised: 03/01/1999 2 3 Disclaimer 4 ---------- 5 6 Although PKWARE will attempt to supply current and accurate 7 information relating to its file formats, algorithms, and the 8 subject programs, the possibility of error can not be eliminated. 9 PKWARE therefore expressly disclaims any warranty that the 10 information contained in the associated materials relating to the 11 subject programs and/or the format of the files created or 12 accessed by the subject programs and/or the algorithms used by 13 the subject programs, or any other matter, is current, correct or 14 accurate as delivered. Any risk of damage due to any possible 15 inaccurate information is assumed by the user of the information. 16 Furthermore, the information relating to the subject programs 17 and/or the file formats created or accessed by the subject 18 programs and/or the algorithms used by the subject programs is 19 subject to change without notice. 20 21 General Format of a ZIP file 22 ---------------------------- 23 24 Files stored in arbitrary order. Large zipfiles can span multiple 25 diskette media. 26 27 Overall zipfile format: 28 29 [local file header + file data + data_descriptor] . . . 30 [central directory] end of central directory record 31 32 33 A. Local file header: 34 35 local file header signature 4 bytes (0x04034b50) 36 version needed to extract 2 bytes 37 general purpose bit flag 2 bytes 38 compression method 2 bytes 39 last mod file time 2 bytes 40 last mod file date 2 bytes 41 crc-32 4 bytes 42 compressed size 4 bytes 43 uncompressed size 4 bytes 44 filename length 2 bytes 45 extra field length 2 bytes 46 47 filename (variable size) 48 extra field (variable size) 49 50 B. Data descriptor: 51 52 crc-32 4 bytes 53 compressed size 4 bytes 54 uncompressed size 4 bytes 55 56 This descriptor exists only if bit 3 of the general 57 purpose bit flag is set (see below). It is byte aligned 58 and immediately follows the last byte of compressed data. 59 This descriptor is used only when it was not possible to 60 seek in the output zip file, e.g., when the output zip file 61 was standard output or a non seekable device. 62 63 C. Central directory structure: 64 65 [file header] . . . end of central dir record 66 67 File header: 68 69 central file header signature 4 bytes (0x02014b50) 70 version made by 2 bytes 71 version needed to extract 2 bytes 72 general purpose bit flag 2 bytes 73 compression method 2 bytes 74 last mod file time 2 bytes 75 last mod file date 2 bytes 76 crc-32 4 bytes 77 compressed size 4 bytes 78 uncompressed size 4 bytes 79 filename length 2 bytes 80 extra field length 2 bytes 81 file comment length 2 bytes 82 disk number start 2 bytes 83 internal file attributes 2 bytes 84 external file attributes 4 bytes 85 relative offset of local header 4 bytes 86 87 filename (variable size) 88 extra field (variable size) 89 file comment (variable size) 90 91 End of central dir record: 92 93 end of central dir signature 4 bytes (0x06054b50) 94 number of this disk 2 bytes 95 number of the disk with the 96 start of the central directory 2 bytes 97 total number of entries in 98 the central dir on this disk 2 bytes 99 total number of entries in 100 the central dir 2 bytes 101 size of the central directory 4 bytes 102 offset of start of central 103 directory with respect to 104 the starting disk number 4 bytes 105 zipfile comment length 2 bytes 106 zipfile comment (variable size) 107 108 D. Explanation of fields: 109 110 version made by (2 bytes) 111 112 The upper byte indicates the compatibility of the file 113 attribute information. If the external file attributes 114 are compatible with MS-DOS and can be read by PKZIP for 115 DOS version 2.04g then this value will be zero. If these 116 attributes are not compatible, then this value will 117 identify the host system on which the attributes are 118 compatible. Software can use this information to determine 119 the line record format for text files etc. The current 120 mappings are: 121 122 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems) 123 1 - Amiga 2 - VAX/VMS 124 3 - Unix 4 - VM/CMS 125 5 - Atari ST 6 - OS/2 H.P.F.S. 126 7 - Macintosh 8 - Z-System 127 9 - CP/M 10 - Windows NTFS 128 11 thru 255 - unused 129 130 The lower byte indicates the version number of the 131 software used to encode the file. The value/10 132 indicates the major version number, and the value 133 mod 10 is the minor version number. 134 135 version needed to extract (2 bytes) 136 137 The minimum software version needed to extract the 138 file, mapped as above. 139 140 general purpose bit flag: (2 bytes) 141 142 Bit 0: If set, indicates that the file is encrypted. 143 144 (For Method 6 - Imploding) 145 Bit 1: If the compression method used was type 6, 146 Imploding, then this bit, if set, indicates 147 an 8K sliding dictionary was used. If clear, 148 then a 4K sliding dictionary was used. 149 Bit 2: If the compression method used was type 6, 150 Imploding, then this bit, if set, indicates 151 3 Shannon-Fano trees were used to encode the 152 sliding dictionary output. If clear, then 2 153 Shannon-Fano trees were used. 154 155 (For Method 8 - Deflating) 156 Bit 2 Bit 1 157 0 0 Normal (-en) compression option was used. 158 0 1 Maximum (-ex) compression option was used. 159 1 0 Fast (-ef) compression option was used. 160 1 1 Super Fast (-es) compression option was used. 161 162 Note: Bits 1 and 2 are undefined if the compression 163 method is any other. 164 165 Bit 3: If this bit is set, the fields crc-32, compressed 166 size and uncompressed size are set to zero in the 167 local header. The correct values are put in the 168 data descriptor immediately following the compressed 169 data. (Note: PKZIP version 2.04g for DOS only 170 recognizes this bit for method 8 compression, newer 171 versions of PKZIP recognize this bit for any 172 compression method.) 173 174 Bit 4: Reserved for use with method 8, for enhanced 175 deflating. 176 177 Bit 5: If this bit is set, this indicates that the file is 178 compressed patched data. (Note: Requires PKZIP 179 version 2.70 or greater) 180 181 Bit 6: Currently unused. 182 183 Bit 7: Currently unused. 184 185 Bit 8: Currently unused. 186 187 Bit 9: Currently unused. 188 189 Bit 10: Currently unused. 190 191 Bit 11: Currently unused. 192 193 Bit 12: Reserved by PKWARE for enhanced compression. 194 195 Bit 13: Reserved by PKWARE. 196 197 Bit 14: Reserved by PKWARE. 198 199 Bit 15: Reserved by PKWARE. 200 201 compression method: (2 bytes) 202 203 (see accompanying documentation for algorithm 204 descriptions) 205 206 0 - The file is stored (no compression) 207 1 - The file is Shrunk 208 2 - The file is Reduced with compression factor 1 209 3 - The file is Reduced with compression factor 2 210 4 - The file is Reduced with compression factor 3 211 5 - The file is Reduced with compression factor 4 212 6 - The file is Imploded 213 7 - Reserved for Tokenizing compression algorithm 214 8 - The file is Deflated 215 9 - Reserved for enhanced Deflating 216 10 - PKWARE Date Compression Library Imploding 217 218 date and time fields: (2 bytes each) 219 220 The date and time are encoded in standard MS-DOS format. 221 If input came from standard input, the date and time are 222 those at which compression was started for this data. 223 224 CRC-32: (4 bytes) 225 226 The CRC-32 algorithm was generously contributed by 227 David Schwaderer and can be found in his excellent 228 book "C Programmers Guide to NetBIOS" published by 229 Howard W. Sams & Co. Inc. The 'magic number' for 230 the CRC is 0xdebb20e3. The proper CRC pre and post 231 conditioning is used, meaning that the CRC register 232 is pre-conditioned with all ones (a starting value 233 of 0xffffffff) and the value is post-conditioned by 234 taking the one's complement of the CRC residual. 235 If bit 3 of the general purpose flag is set, this 236 field is set to zero in the local header and the correct 237 value is put in the data descriptor and in the central 238 directory. 239 240 compressed size: (4 bytes) 241 uncompressed size: (4 bytes) 242 243 The size of the file compressed and uncompressed, 244 respectively. If bit 3 of the general purpose bit flag 245 is set, these fields are set to zero in the local header 246 and the correct values are put in the data descriptor and 247 in the central directory. 248 249 filename length: (2 bytes) 250 extra field length: (2 bytes) 251 file comment length: (2 bytes) 252 253 The length of the filename, extra field, and comment 254 fields respectively. The combined length of any 255 directory record and these three fields should not 256 generally exceed 65,535 bytes. If input came from standard 257 input, the filename length is set to zero. 258 259 disk number start: (2 bytes) 260 261 The number of the disk on which this file begins. 262 263 internal file attributes: (2 bytes) 264 265 The lowest bit of this field indicates, if set, that 266 the file is apparently an ASCII or text file. If not 267 set, that the file apparently contains binary data. 268 The remaining bits are unused in version 1.0. 269 270 Bits 1 and 2 are reserved for use by PKWARE. 271 272 external file attributes: (4 bytes) 273 274 The mapping of the external attributes is 275 host-system dependent (see 'version made by'). For 276 MS-DOS, the low order byte is the MS-DOS directory 277 attribute byte. If input came from standard input, this 278 field is set to zero. 279 280 relative offset of local header: (4 bytes) 281 282 This is the offset from the start of the first disk on 283 which this file appears, to where the local header should 284 be found. 285 286 filename: (Variable) 287 288 The name of the file, with optional relative path. 289 The path stored should not contain a drive or 290 device letter, or a leading slash. All slashes 291 should be forward slashes '/' as opposed to 292 backwards slashes '\' for compatibility with Amiga 293 and Unix file systems etc. If input came from standard 294 input, there is no filename field. 295 296 extra field: (Variable) 297 298 This is for future expansion. If additional information 299 needs to be stored in the future, it should be stored 300 here. Earlier versions of the software can then safely 301 skip this file, and find the next file or header. This 302 field will be 0 length in version 1.0. 303 304 In order to allow different programs and different types 305 of information to be stored in the 'extra' field in .ZIP 306 files, the following structure should be used for all 307 programs storing data in this field: 308 309 header1+data1 + header2+data2 . . . 310 311 Each header should consist of: 312 313 Header ID - 2 bytes 314 Data Size - 2 bytes 315 316 Note: all fields stored in Intel low-byte/high-byte order. 317 318 The Header ID field indicates the type of data that is in 319 the following data block. 320 321 Header ID's of 0 thru 31 are reserved for use by PKWARE. 322 The remaining ID's can be used by third party vendors for 323 proprietary usage. 324 325 The current Header ID mappings defined by PKWARE are: 326 327 0x0007 AV Info 328 0x0009 OS/2 329 0x000a NTFS 330 0x000c VAX/VMS 331 0x000d Unix 332 0x000f Patch Descriptor 333 334 Several third party mappings commonly used are: 335 336 0x4b46 FWKCS MD5 (see below) 337 0x07c8 Macintosh 338 0x4341 Acorn/SparkFS 339 0x4453 Windows NT security descriptor (binary ACL) 340 0x4704 VM/CMS 341 0x470f MVS 342 0x4c41 OS/2 access control list (text ACL) 343 0x4d49 Info-ZIP VMS (VAX or Alpha) 344 0x5455 extended timestamp 345 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc) 346 0x6542 BeOS/BeBox 347 0x756e ASi Unix 348 0x7855 Info-ZIP Unix (new) 349 0xfd4a SMS/QDOS 350 351 The Data Size field indicates the size of the following 352 data block. Programs can use this value to skip to the 353 next header block, passing over any data blocks that are 354 not of interest. 355 356 Note: As stated above, the size of the entire .ZIP file 357 header, including the filename, comment, and extra 358 field should not exceed 64K in size. 359 360 In case two different programs should appropriate the same 361 Header ID value, it is strongly recommended that each 362 program place a unique signature of at least two bytes in 363 size (and preferably 4 bytes or bigger) at the start of 364 each data area. Every program should verify that its 365 unique signature is present, in addition to the Header ID 366 value being correct, before assuming that it is a block of 367 known type. 368 369 -OS/2 Extra Field: 370 371 The following is the layout of the OS/2 attributes "extra" 372 block. (Last Revision 09/05/95) 373 374 Note: all fields stored in Intel low-byte/high-byte order. 375 376 Value Size Description 377 ----- ---- ----------- 378 (OS/2) 0x0009 2 bytes Tag for this "extra" block type 379 TSize 2 bytes Size for the following data block 380 BSize 4 bytes Uncompressed Block Size 381 CType 2 bytes Compression type 382 EACRC 4 bytes CRC value for uncompress block 383 (var) variable Compressed block 384 385 The OS/2 extended attribute structure (FEA2LIST) is 386 compressed and then stored in it's entirety within this 387 structure. There will only ever be one "block" of data in 388 VarFields[]. 389 390 -UNIX Extra Field: 391 392 The following is the layout of the Unix "extra" block. 393 Note: all fields are stored in Intel low-byte/high-byte 394 order. 395 396 Value Size Description 397 ----- ---- ----------- 398 (UNIX) 0x000d 2 bytes Tag for this "extra" block type 399 TSize 2 bytes Size for the following data block 400 Atime 4 bytes File last access time 401 Mtime 4 bytes File last modification time 402 Uid 2 bytes File user ID 403 Gid 2 bytes File group ID 404 (var) variable Variable length data field 405 406 The variable length data field will contain file type 407 specific data. Currently the only values allowed are 408 the original "linked to" file names for hard or symbolic 409 links. 410 411 -VAX/VMS Extra Field: 412 413 The following is the layout of the VAX/VMS attributes 414 "extra" block. 415 416 Note: all fields stored in Intel low-byte/high-byte order. 417 418 Value Size Description 419 ----- ---- ----------- 420 (VMS) 0x000c 2 bytes Tag for this "extra" block type 421 TSize 2 bytes Size of the total "extra" block 422 CRC 4 bytes 32-bit CRC for remainder of the block 423 Tag1 2 bytes VMS attribute tag value #1 424 Size1 2 bytes Size of attribute #1, in bytes 425 (var.) Size1 Attribute #1 data 426 . 427 . 428 . 429 TagN 2 bytes VMS attribute tage value #N 430 SizeN 2 bytes Size of attribute #N, in bytes 431 (var.) SizeN Attribute #N data 432 433 Rules: 434 435 1. There will be one or more of attributes present, which 436 will each be preceded by the above TagX & SizeX values. 437 These values are identical to the ATR$C_XXXX and 438 ATR$S_XXXX constants which are defined in ATR.H under 439 VMS C. Neither of these values will ever be zero. 440 441 2. No word alignment or padding is performed. 442 443 3. A well-behaved PKZIP/VMS program should never produce 444 more than one sub-block with the same TagX value. Also, 445 there will never be more than one "extra" block of type 446 0x000c in a particular directory record. 447 448 -NTFS Extra Field: 449 450 The following is the layout of the NTFS attributes 451 "extra" block. 452 453 Note: all fields stored in Intel low-byte/high-byte order. 454 455 Value Size Description 456 ----- ---- ----------- 457 (NTFS) 0x000a 2 bytes Tag for this "extra" block type 458 TSize 2 bytes Size of the total "extra" block 459 Reserved 4 bytes Reserved for future use 460 Tag1 2 bytes NTFS attribute tag value #1 461 Size1 2 bytes Size of attribute #1, in bytes 462 (var.) Size1 Attribute #1 data 463 . 464 . 465 . 466 TagN 2 bytes NTFS attribute tage value #N 467 SizeN 2 bytes Size of attribute #N, in bytes 468 (var.) SizeN Attribute #N data 469 470 For NTFS, values for Tag1 through TagN are as follows: 471 (currently only one set of attributes is defined for NTFS) 472 473 Tag Size Description 474 ----- ---- ----------- 475 0x0001 2 bytes Tag for attribute #1 476 Size1 2 bytes Size of attribute #1, in bytes 477 Mtime 8 bytes File last modification time 478 Atime 8 bytes File last access time 479 Ctime 8 bytes File creation time 480 481 -PATCH Descriptor Extra Field: 482 483 The following is the layout of the Patch Descriptor "extra" 484 block. 485 486 Note: all fields stored in Intel low-byte/high-byte order. 487 488 Value Size Description 489 ----- ---- ----------- 490 (Patch) 0x000f 2 bytes Tag for this "extra" block type 491 TSize 2 bytes Size of the total "extra" block 492 Version 2 bytes Version of the descriptor 493 Flags 4 bytes Actions and reactions (see below) 494 OldSize 4 bytes Size of the file about to be patched 495 OldCRC 4 bytes 32-bit CRC of the file to be patched 496 NewSize 4 bytes Size of the resulting file 497 NewCRC 4 bytes 32-bit CRC of the resulting file 498 499 Actions and reactions 500 501 Bits Description 502 ---- ---------------- 503 0 Use for autodetection 504 1 Treat as selfpatch 505 2-3 RESERVED 506 4-5 Action (see below) 507 6-7 RESERVED 508 8-9 Reaction (see below) to absent file 509 10-11 Reaction (see below) to newer file 510 12-13 Reaction (see below) to unknown file 511 14-15 RESERVED 512 16-31 RESERVED 513 514 Actions 515 516 Action Value 517 ------ ----- 518 none 0 519 add 1 520 delete 2 521 patch 3 522 523 Reactions 524 525 Reaction Value 526 -------- ----- 527 ask 0 528 skip 1 529 ignore 2 530 fail 3 531 532 - FWKCS MD5 Extra Field: 533 534 The FWKCS Contents_Signature System, used in 535 automatically identifying files independent of filename, 536 optionally adds and uses an extra field to support the 537 rapid creation of an enhanced contents_signature: 538 539 Header ID = 0x4b46 540 Data Size = 0x0013 541 Preface = 'M','D','5' 542 followed by 16 bytes containing the uncompressed file's 543 128_bit MD5 hash(1), low byte first. 544 545 When FWKCS revises a zipfile central directory to add 546 this extra field for a file, it also replaces the 547 central directory entry for that file's uncompressed 548 filelength with a measured value. 549 550 FWKCS provides an option to strip this extra field, if 551 present, from a zipfile central directory. In adding 552 this extra field, FWKCS preserves Zipfile Authenticity 553 Verification; if stripping this extra field, FWKCS 554 preserves all versions of AV through PKZIP version 2.04g. 555 556 FWKCS, and FWKCS Contents_Signature System, are 557 trademarks of Frederick W. Kantor. 558 559 (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer 560 Science and RSA Data Security, Inc., April 1992. 561 ll.76-77: "The MD5 algorithm is being placed in the 562 public domain for review and possible adoption as a 563 standard." 564 565 file comment: (Variable) 566 567 The comment for this file. 568 569 number of this disk: (2 bytes) 570 571 The number of this disk, which contains central 572 directory end record. 573 574 number of the disk with the start of the central 575 directory: (2 bytes) 576 577 The number of the disk on which the central 578 directory starts. 579 580 total number of entries in the central dir on 581 this disk: (2 bytes) 582 583 The number of central directory entries on this disk. 584 585 total number of entries in the central dir: (2 bytes) 586 587 The total number of files in the zipfile. 588 589 size of the central directory: (4 bytes) 590 591 The size (in bytes) of the entire central directory. 592 593 offset of start of central directory with respect to 594 the starting disk number: (4 bytes) 595 596 Offset of the start of the central directory on the 597 disk on which the central directory starts. 598 599 zipfile comment length: (2 bytes) 600 601 The length of the comment for this zipfile. 602 603 zipfile comment: (Variable) 604 605 The comment for this zipfile. 606 607 D. General notes: 608 609 1) All fields unless otherwise noted are unsigned and stored 610 in Intel low-byte:high-byte, low-word:high-word order. 611 612 2) String fields are not null terminated, since the 613 length is given explicitly. 614 615 3) Local headers should not span disk boundaries. Also, even 616 though the central directory can span disk boundaries, no 617 single record in the central directory should be split 618 across disks. 619 620 4) The entries in the central directory may not necessarily 621 be in the same order that files appear in the zipfile. 622 623 UnShrinking - Method 1 624 ---------------------- 625 626 Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm 627 with partial clearing. The initial code size is 9 bits, and 628 the maximum code size is 13 bits. Shrinking differs from 629 conventional Dynamic Ziv-Lempel-Welch implementations in several 630 respects: 631 632 1) The code size is controlled by the compressor, and is not 633 automatically increased when codes larger than the current 634 code size are created (but not necessarily used). When 635 the decompressor encounters the code sequence 256 636 (decimal) followed by 1, it should increase the code size 637 read from the input stream to the next bit size. No 638 blocking of the codes is performed, so the next code at 639 the increased size should be read from the input stream 640 immediately after where the previous code at the smaller 641 bit size was read. Again, the decompressor should not 642 increase the code size used until the sequence 256,1 is 643 encountered. 644 645 2) When the table becomes full, total clearing is not 646 performed. Rather, when the compressor emits the code 647 sequence 256,2 (decimal), the decompressor should clear 648 all leaf nodes from the Ziv-Lempel tree, and continue to 649 use the current code size. The nodes that are cleared 650 from the Ziv-Lempel tree are then re-used, with the lowest 651 code value re-used first, and the highest code value 652 re-used last. The compressor can emit the sequence 256,2 653 at any time. 654 655 Expanding - Methods 2-5 656 ----------------------- 657 658 The Reducing algorithm is actually a combination of two 659 distinct algorithms. The first algorithm compresses repeated 660 byte sequences, and the second algorithm takes the compressed 661 stream from the first algorithm and applies a probabilistic 662 compression method. 663 664 The probabilistic compression stores an array of 'follower 665 sets' S(j), for j=0 to 255, corresponding to each possible 666 ASCII character. Each set contains between 0 and 32 667 characters, to be denoted as S(j)[0],...,S(j)[m], where m<32. 668 The sets are stored at the beginning of the data area for a 669 Reduced file, in reverse order, with S(255) first, and S(0) 670 last. 671 672 The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] }, 673 where N(j) is the size of set S(j). N(j) can be 0, in which 674 case the follower set for S(j) is empty. Each N(j) value is 675 encoded in 6 bits, followed by N(j) eight bit character values 676 corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If 677 N(j) is 0, then no values for S(j) are stored, and the value 678 for N(j-1) immediately follows. 679 680 Immediately after the follower sets, is the compressed data 681 stream. The compressed data stream can be interpreted for the 682 probabilistic decompression as follows: 683 684 let Last-Character <- 0. 685 loop until done 686 if the follower set S(Last-Character) is empty then 687 read 8 bits from the input stream, and copy this 688 value to the output stream. 689 otherwise if the follower set S(Last-Character) is non-empty then 690 read 1 bit from the input stream. 691 if this bit is not zero then 692 read 8 bits from the input stream, and copy this 693 value to the output stream. 694 otherwise if this bit is zero then 695 read B(N(Last-Character)) bits from the input 696 stream, and assign this value to I. 697 Copy the value of S(Last-Character)[I] to the 698 output stream. 699 700 assign the last value placed on the output stream to 701 Last-Character. 702 end loop 703 704 B(N(j)) is defined as the minimal number of bits required to 705 encode the value N(j)-1. 706 707 The decompressed stream from above can then be expanded to 708 re-create the original file as follows: 709 710 let State <- 0. 711 712 loop until done 713 read 8 bits from the input stream into C. 714 case State of 715 0: if C is not equal to DLE (144 decimal) then 716 copy C to the output stream. 717 otherwise if C is equal to DLE then 718 let State <- 1. 719 720 1: if C is non-zero then 721 let V <- C. 722 let Len <- L(V) 723 let State <- F(Len). 724 otherwise if C is zero then 725 copy the value 144 (decimal) to the output stream. 726 let State <- 0 727 728 2: let Len <- Len + C 729 let State <- 3. 730 731 3: move backwards D(V,C) bytes in the output stream 732 (if this position is before the start of the output 733 stream, then assume that all the data before the 734 start of the output stream is filled with zeros). 735 copy Len+3 bytes from this position to the output stream. 736 let State <- 0. 737 end case 738 end loop 739 740 The functions F,L, and D are dependent on the 'compression 741 factor', 1 through 4, and are defined as follows: 742 743 For compression factor 1: 744 L(X) equals the lower 7 bits of X. 745 F(X) equals 2 if X equals 127 otherwise F(X) equals 3. 746 D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1. 747 For compression factor 2: 748 L(X) equals the lower 6 bits of X. 749 F(X) equals 2 if X equals 63 otherwise F(X) equals 3. 750 D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1. 751 For compression factor 3: 752 L(X) equals the lower 5 bits of X. 753 F(X) equals 2 if X equals 31 otherwise F(X) equals 3. 754 D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1. 755 For compression factor 4: 756 L(X) equals the lower 4 bits of X. 757 F(X) equals 2 if X equals 15 otherwise F(X) equals 3. 758 D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1. 759 760 Imploding - Method 6 761 -------------------- 762 763 The Imploding algorithm is actually a combination of two distinct 764 algorithms. The first algorithm compresses repeated byte 765 sequences using a sliding dictionary. The second algorithm is 766 used to compress the encoding of the sliding dictionary output, 767 using multiple Shannon-Fano trees. 768 769 The Imploding algorithm can use a 4K or 8K sliding dictionary 770 size. The dictionary size used can be determined by bit 1 in the 771 general purpose flag word; a 0 bit indicates a 4K dictionary 772 while a 1 bit indicates an 8K dictionary. 773 774 The Shannon-Fano trees are stored at the start of the compressed 775 file. The number of trees stored is defined by bit 2 in the 776 general purpose flag word; a 0 bit indicates two trees stored, a 777 1 bit indicates three trees are stored. If 3 trees are stored, 778 the first Shannon-Fano tree represents the encoding of the 779 Literal characters, the second tree represents the encoding of 780 the Length information, the third represents the encoding of the 781 Distance information. When 2 Shannon-Fano trees are stored, the 782 Length tree is stored first, followed by the Distance tree. 783 784 The Literal Shannon-Fano tree, if present is used to represent 785 the entire ASCII character set, and contains 256 values. This 786 tree is used to compress any data not compressed by the sliding 787 dictionary algorithm. When this tree is present, the Minimum 788 Match Length for the sliding dictionary is 3. If this tree is 789 not present, the Minimum Match Length is 2. 790 791 The Length Shannon-Fano tree is used to compress the Length part 792 of the (length,distance) pairs from the sliding dictionary 793 output. The Length tree contains 64 values, ranging from the 794 Minimum Match Length, to 63 plus the Minimum Match Length. 795 796 The Distance Shannon-Fano tree is used to compress the Distance 797 part of the (length,distance) pairs from the sliding dictionary 798 output. The Distance tree contains 64 values, ranging from 0 to 799 63, representing the upper 6 bits of the distance value. The 800 distance values themselves will be between 0 and the sliding 801 dictionary size, either 4K or 8K. 802 803 The Shannon-Fano trees themselves are stored in a compressed 804 format. The first byte of the tree data represents the number of 805 bytes of data representing the (compressed) Shannon-Fano tree 806 minus 1. The remaining bytes represent the Shannon-Fano tree 807 data encoded as: 808 809 High 4 bits: Number of values at this bit length + 1. (1 - 16) 810 Low 4 bits: Bit Length needed to represent value + 1. (1 - 16) 811 812 The Shannon-Fano codes can be constructed from the bit lengths 813 using the following algorithm: 814 815 1) Sort the Bit Lengths in ascending order, while retaining the 816 order of the original lengths stored in the file. 817 818 2) Generate the Shannon-Fano trees: 819 820 Code <- 0 821 CodeIncrement <- 0 822 LastBitLength <- 0 823 i <- number of Shannon-Fano codes - 1 (either 255 or 63) 824 825 loop while i >= 0 826 Code = Code + CodeIncrement 827 if BitLength(i) <> LastBitLength then 828 LastBitLength=BitLength(i) 829 CodeIncrement = 1 shifted left (16 - LastBitLength) 830 ShannonCode(i) = Code 831 i <- i - 1 832 end loop 833 834 3) Reverse the order of all the bits in the above ShannonCode() 835 vector, so that the most significant bit becomes the least 836 significant bit. For example, the value 0x1234 (hex) would 837 become 0x2C48 (hex). 838 839 4) Restore the order of Shannon-Fano codes as originally stored 840 within the file. 841 842 Example: 843 844 This example will show the encoding of a Shannon-Fano tree 845 of size 8. Notice that the actual Shannon-Fano trees used 846 for Imploding are either 64 or 256 entries in size. 847 848 Example: 0x02, 0x42, 0x01, 0x13 849 850 The first byte indicates 3 values in this table. Decoding the 851 bytes: 852 0x42 = 5 codes of 3 bits long 853 0x01 = 1 code of 2 bits long 854 0x13 = 2 codes of 4 bits long 855 856 This would generate the original bit length array of: 857 (3, 3, 3, 3, 3, 2, 4, 4) 858 859 There are 8 codes in this table for the values 0 thru 7. Using 860 the algorithm to obtain the Shannon-Fano codes produces: 861 862 Reversed Order Original 863 Val Sorted Constructed Code Value Restored Length 864 --- ------ ----------------- -------- -------- ------ 865 0: 2 1100000000000000 11 101 3 866 1: 3 1010000000000000 101 001 3 867 2: 3 1000000000000000 001 110 3 868 3: 3 0110000000000000 110 010 3 869 4: 3 0100000000000000 010 100 3 870 5: 3 0010000000000000 100 11 2 871 6: 4 0001000000000000 1000 1000 4 872 7: 4 0000000000000000 0000 0000 4 873 874 The values in the Val, Order Restored and Original Length columns 875 now represent the Shannon-Fano encoding tree that can be used for 876 decoding the Shannon-Fano encoded data. How to parse the 877 variable length Shannon-Fano values from the data stream is beyond 878 the scope of this document. (See the references listed at the end of 879 this document for more information.) However, traditional decoding 880 schemes used for Huffman variable length decoding, such as the 881 Greenlaw algorithm, can be successfully applied. 882 883 The compressed data stream begins immediately after the 884 compressed Shannon-Fano data. The compressed data stream can be 885 interpreted as follows: 886 887 loop until done 888 read 1 bit from input stream. 889 890 if this bit is non-zero then (encoded data is literal data) 891 if Literal Shannon-Fano tree is present 892 read and decode character using Literal Shannon-Fano tree. 893 otherwise 894 read 8 bits from input stream. 895 copy character to the output stream. 896 otherwise (encoded data is sliding dictionary match) 897 if 8K dictionary size 898 read 7 bits for offset Distance (lower 7 bits of offset). 899 otherwise 900 read 6 bits for offset Distance (lower 6 bits of offset). 901 902 using the Distance Shannon-Fano tree, read and decode the 903 upper 6 bits of the Distance value. 904 905 using the Length Shannon-Fano tree, read and decode 906 the Length value. 907 908 Length <- Length + Minimum Match Length 909 910 if Length = 63 + Minimum Match Length 911 read 8 bits from the input stream, 912 add this value to Length. 913 914 move backwards Distance+1 bytes in the output stream, and 915 copy Length characters from this position to the output 916 stream. (if this position is before the start of the output 917 stream, then assume that all the data before the start of 918 the output stream is filled with zeros). 919 end loop 920 921 Tokenizing - Method 7 922 -------------------- 923 924 This method is not used by PKZIP. 925 926 Deflating - Method 8 927 ----------------- 928 929 The Deflate algorithm is similar to the Implode algorithm using 930 a sliding dictionary of up to 32K with secondary compression 931 from Huffman/Shannon-Fano codes. 932 933 The compressed data is stored in blocks with a header describing 934 the block and the Huffman codes used in the data block. The header 935 format is as follows: 936 937 Bit 0: Last Block bit This bit is set to 1 if this is the last 938 compressed block in the data. 939 Bits 1-2: Block type 940 00 (0) - Block is stored - All stored data is byte aligned. 941 Skip bits until next byte, then next word = block 942 length, followed by the ones compliment of the block 943 length word. Remaining data in block is the stored 944 data. 945 946 01 (1) - Use fixed Huffman codes for literal and distance codes. 947 Lit Code Bits Dist Code Bits 948 --------- ---- --------- ---- 949 0 - 143 8 0 - 31 5 950 144 - 255 9 951 256 - 279 7 952 280 - 287 8 953 954 Literal codes 286-287 and distance codes 30-31 are 955 never used but participate in the huffman construction. 956 957 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes) 958 959 11 (3) - Reserved - Flag a "Error in compressed data" if seen. 960 961 Expanding Huffman Codes 962 ----------------------- 963 If the data block is stored with dynamic Huffman codes, the Huffman 964 codes are sent in the following compressed format: 965 966 5 Bits: # of Literal codes sent - 256 (256 - 286) 967 All other codes are never sent. 968 5 Bits: # of Dist codes - 1 (1 - 32) 969 4 Bits: # of Bit Length codes - 3 (3 - 19) 970 971 The Huffman codes are sent as bit lengths and the codes are built as 972 described in the implode algorithm. The bit lengths themselves are 973 compressed with Huffman codes. There are 19 bit length codes: 974 975 0 - 15: Represent bit lengths of 0 - 15 976 16: Copy the previous bit length 3 - 6 times. 977 The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6) 978 Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will 979 expand to 12 bit lengths of 8 (1 + 6 + 5) 980 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length) 981 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length) 982 983 The lengths of the bit length codes are sent packed 3 bits per value 984 (0 - 7) in the following order: 985 986 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 987 988 The Huffman codes should be built as described in the Implode algorithm 989 except codes are assigned starting at the shortest bit length, i.e. the 990 shortest code should be all 0's rather than all 1's. Also, codes with 991 a bit length of zero do not participate in the tree construction. The 992 codes are then used to decode the bit lengths for the literal and 993 distance tables. 994 995 The bit lengths for the literal tables are sent first with the number 996 of entries sent described by the 5 bits sent earlier. There are up 997 to 286 literal characters; the first 256 represent the respective 8 998 bit character, code 256 represents the End-Of-Block code, the remaining 999 29 codes represent copy lengths of 3 thru 258. There are up to 30 1000 distance codes representing distances from 1 thru 32k as described 1001 below. 1002 1003 Length Codes 1004 ------------ 1005 Extra Extra Extra Extra 1006 Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s) 1007 ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- --------- 1008 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162 1009 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194 1010 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226 1011 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257 1012 261 0 7 269 2 19-22 277 4 67-82 285 0 258 1013 262 0 8 270 2 23-26 278 4 83-98 1014 263 0 9 271 2 27-30 279 4 99-114 1015 264 0 10 272 2 31-34 280 4 115-130 1016 1017 Distance Codes 1018 -------------- 1019 Extra Extra Extra Extra 1020 Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance 1021 ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- -------- 1022 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144 1023 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192 1024 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288 1025 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384 1026 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576 1027 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768 1028 6 2 9-12 14 6 129-192 22 10 2049-3072 1029 7 2 13-16 15 6 193-256 23 10 3073-4096 1030 1031 The compressed data stream begins immediately after the 1032 compressed header data. The compressed data stream can be 1033 interpreted as follows: 1034 1035 do 1036 read header from input stream. 1037 1038 if stored block 1039 skip bits until byte aligned 1040 read count and 1's compliment of count 1041 copy count bytes data block 1042 otherwise 1043 loop until end of block code sent 1044 decode literal character from input stream 1045 if literal < 256 1046 copy character to the output stream 1047 otherwise 1048 if literal = end of block 1049 break from loop 1050 otherwise 1051 decode distance from input stream 1052 1053 move backwards distance bytes in the output stream, and 1054 copy length characters from this position to the output 1055 stream. 1056 end loop 1057 while not last block 1058 1059 if data descriptor exists 1060 skip bits until byte aligned 1061 read crc and sizes 1062 endif 1063 1064 Decryption 1065 ---------- 1066 1067 The encryption used in PKZIP was generously supplied by Roger 1068 Schlafly. PKWARE is grateful to Mr. Schlafly for his expert 1069 help and advice in the field of data encryption. 1070 1071 PKZIP encrypts the compressed data stream. Encrypted files must 1072 be decrypted before they can be extracted. 1073 1074 Each encrypted file has an extra 12 bytes stored at the start of 1075 the data area defining the encryption header for that file. The 1076 encryption header is originally set to random values, and then 1077 itself encrypted, using three, 32-bit keys. The key values are 1078 initialized using the supplied encryption password. After each byte 1079 is encrypted, the keys are then updated using pseudo-random number 1080 generation techniques in combination with the same CRC-32 algorithm 1081 used in PKZIP and described elsewhere in this document. 1082 1083 The following is the basic steps required to decrypt a file: 1084 1085 1) Initialize the three 32-bit keys with the password. 1086 2) Read and decrypt the 12-byte encryption header, further 1087 initializing the encryption keys. 1088 3) Read and decrypt the compressed data stream using the 1089 encryption keys. 1090 1091 Step 1 - Initializing the encryption keys 1092 ----------------------------------------- 1093 1094 Key(0) <- 305419896 1095 Key(1) <- 591751049 1096 Key(2) <- 878082192 1097 1098 loop for i <- 0 to length(password)-1 1099 update_keys(password(i)) 1100 end loop 1101 1102 Where update_keys() is defined as: 1103 1104 update_keys(char): 1105 Key(0) <- crc32(key(0),char) 1106 Key(1) <- Key(1) + (Key(0) & 000000ffH) 1107 Key(1) <- Key(1) * 134775813 + 1 1108 Key(2) <- crc32(key(2),key(1) >> 24) 1109 end update_keys 1110 1111 Where crc32(old_crc,char) is a routine that given a CRC value and a 1112 character, returns an updated CRC value after applying the CRC-32 1113 algorithm described elsewhere in this document. 1114 1115 Step 2 - Decrypting the encryption header 1116 ----------------------------------------- 1117 1118 The purpose of this step is to further initialize the encryption 1119 keys, based on random data, to render a plaintext attack on the 1120 data ineffective. 1121 1122 Read the 12-byte encryption header into Buffer, in locations 1123 Buffer(0) thru Buffer(11). 1124 1125 loop for i <- 0 to 11 1126 C <- buffer(i) ^ decrypt_byte() 1127 update_keys(C) 1128 buffer(i) <- C 1129 end loop 1130 1131 Where decrypt_byte() is defined as: 1132 1133 unsigned char decrypt_byte() 1134 local unsigned short temp 1135 temp <- Key(2) | 2 1136 decrypt_byte <- (temp * (temp ^ 1)) >> 8 1137 end decrypt_byte 1138 1139 After the header is decrypted, the last 1 or 2 bytes in Buffer 1140 should be the high-order word/byte of the CRC for the file being 1141 decrypted, stored in Intel low-byte/high-byte order. Versions of 1142 PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is 1143 used on versions after 2.0. This can be used to test if the password 1144 supplied is correct or not. 1145 1146 Step 3 - Decrypting the compressed data stream 1147 ---------------------------------------------- 1148 1149 The compressed data stream can be decrypted as follows: 1150 1151 loop until done 1152 read a character into C 1153 Temp <- C ^ decrypt_byte() 1154 update_keys(temp) 1155 output Temp 1156 end loop 1157 1158 In addition to the above mentioned contributors to PKZIP and PKUNZIP, 1159 I would like to extend special thanks to Robert Mahoney for suggesting 1160 the extension .ZIP for this software. 1161 1162 References: 1163 1164 Fiala, Edward R., and Greene, Daniel H., "Data compression with 1165 finite windows", Communications of the ACM, Volume 32, Number 4, 1166 April 1989, pages 490-505. 1167 1168 Held, Gilbert, "Data Compression, Techniques and Applications, 1169 Hardware and Software Considerations", John Wiley & Sons, 1987. 1170 1171 Huffman, D.A., "A method for the construction of minimum-redundancy 1172 codes", Proceedings of the IRE, Volume 40, Number 9, September 1952, 1173 pages 1098-1101. 1174 1175 Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14, 1176 Number 10, October 1989, pages 29-37. 1177 1178 Nelson, Mark, "The Data Compression Book", M&T Books, 1991. 1179 1180 Storer, James A., "Data Compression, Methods and Theory", 1181 Computer Science Press, 1988 1182 1183 Welch, Terry, "A Technique for High-Performance Data Compression", 1184 IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19. 1185 1186 Ziv, J. and Lempel, A., "A universal algorithm for sequential data 1187 compression", Communications of the ACM, Volume 30, Number 6, 1188 June 1987, pages 520-540. 1189 1190 Ziv, J. and Lempel, A., "Compression of individual sequences via 1191 variable-rate coding", IEEE Transactions on Information Theory, 1192 Volume 24, Number 5, September 1978, pages 530-536.