[ tor-browser ].git.dasho

lzma-specification.txt (36761B)
      1 LZMA specification (DRAFT version)
      2 ----------------------------------
      3 
      4 Author: Igor Pavlov
      5 Date: 2015-06-14
      6 
      7 This specification defines the format of LZMA compressed data and lzma file format.
      8 
      9 Notation 
     10 --------
     11 
     12 We use the syntax of C++ programming language.
     13 We use the following types in C++ code:
     14  unsigned - unsigned integer, at least 16 bits in size
     15  int      - signed integer, at least 16 bits in size
     16  UInt64   - 64-bit unsigned integer
     17  UInt32   - 32-bit unsigned integer
     18  UInt16   - 16-bit unsigned integer
     19  Byte     - 8-bit unsigned integer
     20  bool     - boolean type with two possible values: false, true
     21 
     22 
     23 lzma file format
     24 ================
     25 
     26 The lzma file contains the raw LZMA stream and the header with related properties.
     27 
     28 The files in that format use ".lzma" extension.
     29 
     30 The lzma file format layout:
     31 
     32 Offset Size Description
     33 
     34  0     1   LZMA model properties (lc, lp, pb) in encoded form
     35  1     4   Dictionary size (32-bit unsigned integer, little-endian)
     36  5     8   Uncompressed size (64-bit unsigned integer, little-endian)
     37 13         Compressed data (LZMA stream)
     38 
     39 LZMA properties:
     40 
     41    name  Range          Description
     42 
     43      lc  [0, 8]         the number of "literal context" bits
     44      lp  [0, 4]         the number of "literal pos" bits
     45      pb  [0, 4]         the number of "pos" bits
     46 dictSize  [0, 2^32 - 1]  the dictionary size 
     47 
     48 The following code encodes LZMA properties:
     49 
     50 void EncodeProperties(Byte *properties)
     51 {
     52  properties[0] = (Byte)((pb * 5 + lp) * 9 + lc);
     53  Set_UInt32_LittleEndian(properties + 1, dictSize);
     54 }
     55 
     56 If the value of dictionary size in properties is smaller than (1 << 12),
     57 the LZMA decoder must set the dictionary size variable to (1 << 12).
     58 
     59 #define LZMA_DIC_MIN (1 << 12)
     60 
     61  unsigned lc, pb, lp;
     62  UInt32 dictSize;
     63  UInt32 dictSizeInProperties;
     64 
     65  void DecodeProperties(const Byte *properties)
     66  {
     67    unsigned d = properties[0];
     68    if (d >= (9 * 5 * 5))
     69      throw "Incorrect LZMA properties";
     70    lc = d % 9;
     71    d /= 9;
     72    pb = d / 5;
     73    lp = d % 5;
     74    dictSizeInProperties = 0;
     75    for (int i = 0; i < 4; i++)
     76      dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i);
     77    dictSize = dictSizeInProperties;
     78    if (dictSize < LZMA_DIC_MIN)
     79      dictSize = LZMA_DIC_MIN;
     80  }
     81 
     82 If "Uncompressed size" field contains ones in all 64 bits, it means that
     83 uncompressed size is unknown and there is the "end marker" in stream,
     84 that indicates the end of decoding point.
     85 In opposite case, if the value from "Uncompressed size" field is not
     86 equal to ((2^64) - 1), the LZMA stream decoding must be finished after
     87 specified number of bytes (Uncompressed size) is decoded. And if there 
     88 is the "end marker", the LZMA decoder must read that marker also.
     89 
     90 
     91 The new scheme to encode LZMA properties
     92 ----------------------------------------
     93 
     94 If LZMA compression is used for some another format, it's recommended to
     95 use a new improved scheme to encode LZMA properties. That new scheme was
     96 used in xz format that uses the LZMA2 compression algorithm.
     97 The LZMA2 is a new compression algorithm that is based on the LZMA algorithm.
     98 
     99 The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports
    100 only reduced set of dictionary sizes:
    101  (2 << 11), (3 << 11),
    102  (2 << 12), (3 << 12),
    103  ...
    104  (2 << 30), (3 << 30),
    105  (2 << 31) - 1
    106 
    107 The dictionary size can be extracted from encoded value with the following code:
    108 
    109  dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));
    110 
    111 Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of 
    112 "lc" and "lp" properties:
    113 
    114  if (lc + lp > 4)
    115    throw "Unsupported properties: (lc + lp) > 4";
    116 
    117 There are some advantages for LZMA decoder with such (lc + lp) value
    118 limitation. It reduces the maximum size of tables allocated by decoder.
    119 And it reduces the complexity of initialization procedure, that can be 
    120 important to keep high speed of decoding of big number of small LZMA streams.
    121 
    122 It's recommended to use that limitation (lc + lp <= 4) for any new format
    123 that uses LZMA compression. Note that the combinations of "lc" and "lp" 
    124 parameters, where (lc + lp > 4), can provide significant improvement in 
    125 compression ratio only in some rare cases.
    126 
    127 The LZMA properties can be encoded into two bytes in new scheme:
    128 
    129 Offset Size Description
    130 
    131  0     1   The dictionary size encoded with LZMA2 scheme
    132  1     1   LZMA model properties (lc, lp, pb) in encoded form
    133 
    134 
    135 The RAM usage 
    136 =============
    137 
    138 The RAM usage for LZMA decoder is determined by the following parts:
    139 
    140 1) The Sliding Window (from 4 KiB to 4 GiB).
    141 2) The probability model counter arrays (arrays of 16-bit variables).
    142 3) Some additional state variables (about 10 variables of 32-bit integers).
    143 
    144 
    145 The RAM usage for Sliding Window
    146 --------------------------------
    147 
    148 There are two main scenarios of decoding:
    149 
    150 1) The decoding of full stream to one RAM buffer.
    151 
    152  If we decode full LZMA stream to one output buffer in RAM, the decoder 
    153  can use that output buffer as sliding window. So the decoder doesn't 
    154  need additional buffer allocated for sliding window.
    155 
    156 2) The decoding to some external storage.
    157 
    158  If we decode LZMA stream to external storage, the decoder must allocate
    159  the buffer for sliding window. The size of that buffer must be equal 
    160  or larger than the value of dictionary size from properties of LZMA stream.
    161 
    162 In this specification we describe the code for decoding to some external
    163 storage. The optimized version of code for decoding of full stream to one
    164 output RAM buffer can require some minor changes in code.
    165 
    166 
    167 The RAM usage for the probability model counters
    168 ------------------------------------------------
    169 
    170 The size of the probability model counter arrays is calculated with the 
    171 following formula:
    172 
    173 size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc))
    174 
    175 Each probability model counter is 11-bit unsigned integer.
    176 If we use 16-bit integer variables (2-byte integers) for these probability 
    177 model counters, the RAM usage required by probability model counter arrays 
    178 can be estimated with the following formula:
    179 
    180  RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc))
    181 
    182 For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is
    183 
    184  RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB
    185 
    186 The maximum RAM state usage is required for decoding the stream with lp = 4 
    187 and lc = 8:
    188 
    189  RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB
    190 
    191 If the decoder uses LZMA2's limited property condition 
    192 (lc + lp <= 4), the RAM usage will be not larger than
    193 
    194  RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB
    195 
    196 
    197 The RAM usage for encoder
    198 -------------------------
    199 
    200 There are many variants for LZMA encoding code.
    201 These variants have different values for memory consumption.
    202 Note that memory consumption for LZMA Encoder can not be 
    203 smaller than memory consumption of LZMA Decoder for same stream.
    204 
    205 The RAM usage required by modern effective implementation of 
    206 LZMA Encoder can be estimated with the following formula:
    207 
    208  Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize.
    209 
    210 But there are some modes of the encoder that require less memory.
    211 
    212 
    213 LZMA Decoding
    214 =============
    215 
    216 The LZMA compression algorithm uses LZ-based compression with Sliding Window
    217 and Range Encoding as entropy coding method.
    218 
    219 
    220 Sliding Window
    221 --------------
    222 
    223 LZMA uses Sliding Window compression similar to LZ77 algorithm.
    224 
    225 LZMA stream must be decoded to the sequence that consists
    226 of MATCHES and LITERALS:
    227  
    228  - a LITERAL is a 8-bit character (one byte).
    229    The decoder just puts that LITERAL to the uncompressed stream.
    230  
    231  - a MATCH is a pair of two numbers (DISTANCE-LENGTH pair).
    232    The decoder takes one byte exactly "DISTANCE" characters behind
    233    current position in the uncompressed stream and puts it to 
    234    uncompressed stream. The decoder must repeat it "LENGTH" times.
    235 
    236 The "DISTANCE" can not be larger than dictionary size.
    237 And the "DISTANCE" can not be larger than the number of bytes in
    238 the uncompressed stream that were decoded before that match.
    239 
    240 In this specification we use cyclic buffer to implement Sliding Window
    241 for LZMA decoder:
    242 
    243 class COutWindow
    244 {
    245  Byte *Buf;
    246  UInt32 Pos;
    247  UInt32 Size;
    248  bool IsFull;
    249 
    250 public:
    251  unsigned TotalPos;
    252  COutStream OutStream;
    253 
    254  COutWindow(): Buf(NULL) {}
    255  ~COutWindow() { delete []Buf; }
    256 
    257  void Create(UInt32 dictSize)
    258  {
    259    Buf = new Byte[dictSize];
    260    Pos = 0;
    261    Size = dictSize;
    262    IsFull = false;
    263    TotalPos = 0;
    264  }
    265 
    266  void PutByte(Byte b)
    267  {
    268    TotalPos++;
    269    Buf[Pos++] = b;
    270    if (Pos == Size)
    271    {
    272      Pos = 0;
    273      IsFull = true;
    274    }
    275    OutStream.WriteByte(b);
    276  }
    277 
    278  Byte GetByte(UInt32 dist) const
    279  {
    280    return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos];
    281  }
    282 
    283  void CopyMatch(UInt32 dist, unsigned len)
    284  {
    285    for (; len > 0; len--)
    286      PutByte(GetByte(dist));
    287  }
    288 
    289  bool CheckDistance(UInt32 dist) const
    290  {
    291    return dist <= Pos || IsFull;
    292  }
    293 
    294  bool IsEmpty() const
    295  {
    296    return Pos == 0 && !IsFull;
    297  }
    298 };
    299 
    300 
    301 In another implementation it's possible to use one buffer that contains 
    302 Sliding Window and the whole data stream after uncompressing.
    303 
    304 
    305 Range Decoder
    306 -------------
    307 
    308 LZMA algorithm uses Range Encoding (1) as entropy coding method.
    309 
    310 LZMA stream contains just one very big number in big-endian encoding.
    311 LZMA decoder uses the Range Decoder to extract a sequence of binary
    312 symbols from that big number.
    313 
    314 The state of the Range Decoder:
    315 
    316 struct CRangeDecoder
    317 {
    318  UInt32 Range; 
    319  UInt32 Code;
    320  InputStream *InStream;
    321 
    322  bool Corrupted;
    323 }
    324 
    325 The notes about UInt32 type for the "Range" and "Code" variables:
    326 
    327  It's possible to use 64-bit (unsigned or signed) integer type
    328  for the "Range" and the "Code" variables instead of 32-bit unsigned,
    329  but some additional code must be used to truncate the values to 
    330  low 32-bits after some operations.
    331 
    332  If the programming language does not support 32-bit unsigned integer type 
    333  (like in case of JAVA language), it's possible to use 32-bit signed integer, 
    334  but some code must be changed. For example, it's required to change the code
    335  that uses comparison operations for UInt32 variables in this specification.
    336 
    337 The Range Decoder can be in some states that can be treated as 
    338 "Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted":
    339 
    340  (Corrupted == false), if the Range Decoder has not detected any corruption.
    341  (Corrupted == true), if the Range Decoder has detected some corruption.
    342 
    343 The reference LZMA Decoder ignores the value of the "Corrupted" variable.
    344 So it continues to decode the stream, even if the corruption can be detected
    345 in the Range Decoder. To provide the full compatibility with output of the 
    346 reference LZMA Decoder, another LZMA Decoder implementations must also 
    347 ignore the value of the "Corrupted" variable.
    348 
    349 The LZMA Encoder is required to create only such LZMA streams, that will not 
    350 lead the Range Decoder to states, where the "Corrupted" variable is set to true.
    351 
    352 The Range Decoder reads first 5 bytes from input stream to initialize
    353 the state:
    354 
    355 bool CRangeDecoder::Init()
    356 {
    357  Corrupted = false;
    358  Range = 0xFFFFFFFF;
    359  Code = 0;
    360 
    361  Byte b = InStream->ReadByte();
    362  
    363  for (int i = 0; i < 4; i++)
    364    Code = (Code << 8) | InStream->ReadByte();
    365  
    366  if (b != 0 || Code == Range)
    367    Corrupted = true;
    368  return b == 0;
    369 }
    370 
    371 The LZMA Encoder always writes ZERO in initial byte of compressed stream.
    372 That scheme allows to simplify the code of the Range Encoder in the 
    373 LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must
    374 stop decoding and report error.
    375 
    376 After the last bit of data was decoded by Range Decoder, the value of the
    377 "Code" variable must be equal to 0. The LZMA Decoder must check it by 
    378 calling the IsFinishedOK() function:
    379 
    380  bool IsFinishedOK() const { return Code == 0; }
    381 
    382 If there is corruption in data stream, there is big probability that
    383 the "Code" value will be not equal to 0 in the Finish() function. So that
    384 check in the IsFinishedOK() function provides very good feature for 
    385 corruption detection.
    386 
    387 The value of the "Range" variable before each bit decoding can not be smaller 
    388 than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in 
    389 described range.
    390 
    391 #define kTopValue ((UInt32)1 << 24)
    392 
    393 void CRangeDecoder::Normalize()
    394 {
    395  if (Range < kTopValue)
    396  {
    397    Range <<= 8;
    398    Code = (Code << 8) | InStream->ReadByte();
    399  }
    400 }
    401 
    402 Notes: if the size of the "Code" variable is larger than 32 bits, it's
    403 required to keep only low 32 bits of the "Code" variable after the change
    404 in Normalize() function.
    405 
    406 If the LZMA Stream is not corrupted, the value of the "Code" variable is
    407 always smaller than value of the "Range" variable.
    408 But the Range Decoder ignores some types of corruptions, so the value of
    409 the "Code" variable can be equal or larger than value of the "Range" variable
    410 for some "Corrupted" archives.
    411 
    412 
    413 LZMA uses Range Encoding only with binary symbols of two types:
    414  1) binary symbols with fixed and equal probabilities (direct bits)
    415  2) binary symbols with predicted probabilities
    416 
    417 The DecodeDirectBits() function decodes the sequence of direct bits:
    418 
    419 UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits)
    420 {
    421  UInt32 res = 0;
    422  do
    423  {
    424    Range >>= 1;
    425    Code -= Range;
    426    UInt32 t = 0 - ((UInt32)Code >> 31);
    427    Code += Range & t;
    428    
    429    if (Code == Range)
    430      Corrupted = true;
    431    
    432    Normalize();
    433    res <<= 1;
    434    res += t + 1;
    435  }
    436  while (--numBits);
    437  return res;
    438 }
    439 
    440 
    441 The Bit Decoding with Probability Model
    442 ---------------------------------------
    443 
    444 The task of Bit Probability Model is to estimate probabilities of binary
    445 symbols. And then it provides the Range Decoder with that information.
    446 The better prediction provides better compression ratio.
    447 The Bit Probability Model uses statistical data of previous decoded
    448 symbols.
    449 
    450 That estimated probability is presented as 11-bit unsigned integer value
    451 that represents the probability of symbol "0".
    452 
    453 #define kNumBitModelTotalBits 11
    454 
    455 Mathematical probabilities can be presented with the following formulas:
    456     probability(symbol_0) = prob / 2048.
    457     probability(symbol_1) =  1 - Probability(symbol_0) =  
    458                           =  1 - prob / 2048 =  
    459                           =  (2048 - prob) / 2048
    460 where the "prob" variable contains 11-bit integer probability counter.
    461 
    462 It's recommended to use 16-bit unsigned integer type, to store these 11-bit
    463 probability values:
    464 
    465 typedef UInt16 CProb;
    466 
    467 Each probability value must be initialized with value ((1 << 11) / 2),
    468 that represents the state, where probabilities of symbols 0 and 1 
    469 are equal to 0.5:
    470 
    471 #define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2)
    472 
    473 The INIT_PROBS macro is used to initialize the array of CProb variables:
    474 
    475 #define INIT_PROBS(p) \
    476 { for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }
    477 
    478 
    479 The DecodeBit() function decodes one bit.
    480 The LZMA decoder provides the pointer to CProb variable that contains 
    481 information about estimated probability for symbol 0 and the Range Decoder 
    482 updates that CProb variable after decoding. The Range Decoder increases 
    483 estimated probability of the symbol that was decoded:
    484 
    485 #define kNumMoveBits 5
    486 
    487 unsigned CRangeDecoder::DecodeBit(CProb *prob)
    488 {
    489  unsigned v = *prob;
    490  UInt32 bound = (Range >> kNumBitModelTotalBits) * v;
    491  unsigned symbol;
    492  if (Code < bound)
    493  {
    494    v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits;
    495    Range = bound;
    496    symbol = 0;
    497  }
    498  else
    499  {
    500    v -= v >> kNumMoveBits;
    501    Code -= bound;
    502    Range -= bound;
    503    symbol = 1;
    504  }
    505  *prob = (CProb)v;
    506  Normalize();
    507  return symbol;
    508 }
    509 
    510 
    511 The Binary Tree of bit model counters
    512 -------------------------------------
    513 
    514 LZMA uses a tree of Bit model variables to decode symbol that needs
    515 several bits for storing. There are two versions of such trees in LZMA:
    516  1) the tree that decodes bits from high bit to low bit (the normal scheme).
    517  2) the tree that decodes bits from low bit to high bit (the reverse scheme).
    518 
    519 Each binary tree structure supports different size of decoded symbol
    520 (the size of binary sequence that contains value of symbol).
    521 If that size of decoded symbol is "NumBits" bits, the tree structure 
    522 uses the array of (2 << NumBits) counters of CProb type. 
    523 But only ((2 << NumBits) - 1) items are used by encoder and decoder.
    524 The first item (the item with index equal to 0) in array is unused.
    525 That scheme with unused array's item allows to simplify the code.
    526 
    527 unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc)
    528 {
    529  unsigned m = 1;
    530  unsigned symbol = 0;
    531  for (unsigned i = 0; i < numBits; i++)
    532  {
    533    unsigned bit = rc->DecodeBit(&probs[m]);
    534    m <<= 1;
    535    m += bit;
    536    symbol |= (bit << i);
    537  }
    538  return symbol;
    539 }
    540 
    541 template <unsigned NumBits>
    542 class CBitTreeDecoder
    543 {
    544  CProb Probs[(unsigned)1 << NumBits];
    545 
    546 public:
    547 
    548  void Init()
    549  {
    550    INIT_PROBS(Probs);
    551  }
    552 
    553  unsigned Decode(CRangeDecoder *rc)
    554  {
    555    unsigned m = 1;
    556    for (unsigned i = 0; i < NumBits; i++)
    557      m = (m << 1) + rc->DecodeBit(&Probs[m]);
    558    return m - ((unsigned)1 << NumBits);
    559  }
    560 
    561  unsigned ReverseDecode(CRangeDecoder *rc)
    562  {
    563    return BitTreeReverseDecode(Probs, NumBits, rc);
    564  }
    565 };
    566 
    567 
    568 LZ part of LZMA 
    569 ---------------
    570 
    571 LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.
    572 
    573 
    574 The Literal Decoding
    575 --------------------
    576 
    577 The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where 
    578 each table contains 0x300 CProb values:
    579 
    580  CProb *LitProbs;
    581 
    582  void CreateLiterals()
    583  {
    584    LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];
    585  }
    586  
    587  void InitLiterals()
    588  {
    589    UInt32 num = (UInt32)0x300 << (lc + lp);
    590    for (UInt32 i = 0; i < num; i++)
    591      LitProbs[i] = PROB_INIT_VAL;
    592  }
    593 
    594 To select the table for decoding it uses the context that consists of
    595 (lc) high bits from previous literal and (lp) low bits from value that
    596 represents current position in outputStream.
    597 
    598 If (State > 7), the Literal Decoder also uses "matchByte" that represents 
    599 the byte in OutputStream at position the is the DISTANCE bytes before 
    600 current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair
    601 of latest decoded match.
    602 
    603 The following code decodes one literal and puts it to Sliding Window buffer:
    604 
    605  void DecodeLiteral(unsigned state, UInt32 rep0)
    606  {
    607    unsigned prevByte = 0;
    608    if (!OutWindow.IsEmpty())
    609      prevByte = OutWindow.GetByte(1);
    610    
    611    unsigned symbol = 1;
    612    unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc));
    613    CProb *probs = &LitProbs[(UInt32)0x300 * litState];
    614    
    615    if (state >= 7)
    616    {
    617      unsigned matchByte = OutWindow.GetByte(rep0 + 1);
    618      do
    619      {
    620        unsigned matchBit = (matchByte >> 7) & 1;
    621        matchByte <<= 1;
    622        unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);
    623        symbol = (symbol << 1) | bit;
    624        if (matchBit != bit)
    625          break;
    626      }
    627      while (symbol < 0x100);
    628    }
    629    while (symbol < 0x100)
    630      symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);
    631    OutWindow.PutByte((Byte)(symbol - 0x100));
    632  }
    633 
    634 
    635 The match length decoding
    636 -------------------------
    637 
    638 The match length decoder returns normalized (zero-based value) 
    639 length of match. That value can be converted to real length of the match 
    640 with the following code:
    641 
    642 #define kMatchMinLen 2
    643 
    644    matchLen = len + kMatchMinLen;
    645 
    646 The match length decoder can return the values from 0 to 271.
    647 And the corresponded real match length values can be in the range 
    648 from 2 to 273.
    649 
    650 The following scheme is used for the match length encoding:
    651 
    652  Binary encoding    Binary Tree structure    Zero-based match length 
    653  sequence                                    (binary + decimal):
    654 
    655  0 xxx              LowCoder[posState]       xxx
    656  1 0 yyy            MidCoder[posState]       yyy + 8
    657  1 1 zzzzzzzz       HighCoder                zzzzzzzz + 16
    658 
    659 LZMA uses bit model variable "Choice" to decode the first selection bit.
    660 
    661 If the first selection bit is equal to 0, the decoder uses binary tree 
    662  LowCoder[posState] to decode 3-bit zero-based match length (xxx).
    663 
    664 If the first selection bit is equal to 1, the decoder uses bit model 
    665  variable "Choice2" to decode the second selection bit.
    666 
    667  If the second selection bit is equal to 0, the decoder uses binary tree 
    668    MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match
    669    length is equal to (yyy + 8).
    670 
    671  If the second selection bit is equal to 1, the decoder uses binary tree 
    672    HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based 
    673    match length is equal to (zzzzzzzz + 16).
    674 
    675 LZMA uses "posState" value as context to select the binary tree 
    676 from LowCoder and MidCoder binary tree arrays:
    677 
    678    unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
    679 
    680 The full code of the length decoder:
    681 
    682 class CLenDecoder
    683 {
    684  CProb Choice;
    685  CProb Choice2;
    686  CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];
    687  CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];
    688  CBitTreeDecoder<8> HighCoder;
    689 
    690 public:
    691 
    692  void Init()
    693  {
    694    Choice = PROB_INIT_VAL;
    695    Choice2 = PROB_INIT_VAL;
    696    HighCoder.Init();
    697    for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)
    698    {
    699      LowCoder[i].Init();
    700      MidCoder[i].Init();
    701    }
    702  }
    703 
    704  unsigned Decode(CRangeDecoder *rc, unsigned posState)
    705  {
    706    if (rc->DecodeBit(&Choice) == 0)
    707      return LowCoder[posState].Decode(rc);
    708    if (rc->DecodeBit(&Choice2) == 0)
    709      return 8 + MidCoder[posState].Decode(rc);
    710    return 16 + HighCoder.Decode(rc);
    711  }
    712 };
    713 
    714 The LZMA decoder uses two instances of CLenDecoder class.
    715 The first instance is for the matches of "Simple Match" type,
    716 and the second instance is for the matches of "Rep Match" type:
    717 
    718  CLenDecoder LenDecoder;
    719  CLenDecoder RepLenDecoder;
    720 
    721 
    722 The match distance decoding
    723 ---------------------------
    724 
    725 LZMA supports dictionary sizes up to 4 GiB minus 1.
    726 The value of match distance (decoded by distance decoder) can be 
    727 from 1 to 2^32. But the distance value that is equal to 2^32 is used to
    728 indicate the "End of stream" marker. So real largest match distance 
    729 that is used for LZ-window match is (2^32 - 1).
    730 
    731 LZMA uses normalized match length (zero-based length) 
    732 to calculate the context state "lenState" do decode the distance value:
    733 
    734 #define kNumLenToPosStates 4
    735 
    736    unsigned lenState = len;
    737    if (lenState > kNumLenToPosStates - 1)
    738      lenState = kNumLenToPosStates - 1;
    739 
    740 The distance decoder returns the "dist" value that is zero-based value 
    741 of match distance. The real match distance can be calculated with the
    742 following code:
    743  
    744  matchDistance = dist + 1; 
    745 
    746 The state of the distance decoder and the initialization code: 
    747 
    748  #define kEndPosModelIndex 14
    749  #define kNumFullDistances (1 << (kEndPosModelIndex >> 1))
    750  #define kNumAlignBits 4
    751 
    752  CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];
    753  CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];
    754  CBitTreeDecoder<kNumAlignBits> AlignDecoder;
    755 
    756  void InitDist()
    757  {
    758    for (unsigned i = 0; i < kNumLenToPosStates; i++)
    759      PosSlotDecoder[i].Init();
    760    AlignDecoder.Init();
    761    INIT_PROBS(PosDecoders);
    762  }
    763 
    764 At first stage the distance decoder decodes 6-bit "posSlot" value with bit
    765 tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different 
    766 "posSlot" values.
    767 
    768    unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
    769 
    770 The encoding scheme for distance value is shown in the following table:
    771 
    772 posSlot (decimal) /
    773      zero-based distance (binary)
    774 0    0
    775 1    1
    776 2    10
    777 3    11
    778 
    779 4    10 x
    780 5    11 x
    781 6    10 xx
    782 7    11 xx
    783 8    10 xxx
    784 9    11 xxx
    785 10    10 xxxx
    786 11    11 xxxx
    787 12    10 xxxxx
    788 13    11 xxxxx
    789 
    790 14    10 yy zzzz
    791 15    11 yy zzzz
    792 16    10 yyy zzzz
    793 17    11 yyy zzzz
    794 ...
    795 62    10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
    796 63    11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
    797 
    798 where 
    799  "x ... x" means the sequence of binary symbols encoded with binary tree and 
    800      "Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13.
    801  "y" means direct bit encoded with range coder.
    802  "zzzz" means the sequence of four binary symbols encoded with binary
    803      tree with "Reverse" scheme, where one common binary tree "AlignDecoder"
    804      is used for all posSlot values.
    805 
    806 If (posSlot < 4), the "dist" value is equal to posSlot value.
    807 
    808 If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of
    809  the high bits of "dist" value and the number of the low bits.
    810 
    811  If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.
    812    (one separated bit tree decoder per one posSlot value) and "Reverse" scheme.
    813    In this implementation we use one CProb array "PosDecoders" that contains 
    814    all CProb variables for all these bit decoders.
    815  
    816  if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct 
    817    bits from RangeDecoder and the low 4 bits are decoded with a bit tree 
    818    decoder "AlignDecoder" with "Reverse" scheme.
    819 
    820 The code to decode zero-based match distance:
    821  
    822  unsigned DecodeDistance(unsigned len)
    823  {
    824    unsigned lenState = len;
    825    if (lenState > kNumLenToPosStates - 1)
    826      lenState = kNumLenToPosStates - 1;
    827    
    828    unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
    829    if (posSlot < 4)
    830      return posSlot;
    831    
    832    unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1);
    833    UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits);
    834    if (posSlot < kEndPosModelIndex)
    835      dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec);
    836    else
    837    {
    838      dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits;
    839      dist += AlignDecoder.ReverseDecode(&RangeDec);
    840    }
    841    return dist;
    842  }
    843 
    844 
    845 
    846 LZMA Decoding modes
    847 -------------------
    848 
    849 There are 2 types of LZMA streams:
    850 
    851 1) The stream with "End of stream" marker.
    852 2) The stream without "End of stream" marker.
    853 
    854 And the LZMA Decoder supports 3 modes of decoding:
    855 
    856 1) The unpack size is undefined. The LZMA decoder stops decoding after 
    857   getting "End of stream" marker. 
    858   The input variables for that case:
    859    
    860      markerIsMandatory = true
    861      unpackSizeDefined = false
    862      unpackSize contains any value
    863 
    864 2) The unpack size is defined and LZMA decoder supports both variants, 
    865   where the stream can contain "End of stream" marker or the stream is
    866   finished without "End of stream" marker. The LZMA decoder must detect 
    867   any of these situations.
    868   The input variables for that case:
    869    
    870      markerIsMandatory = false
    871      unpackSizeDefined = true
    872      unpackSize contains unpack size
    873 
    874 3) The unpack size is defined and the LZMA stream must contain 
    875   "End of stream" marker
    876   The input variables for that case:
    877    
    878      markerIsMandatory = true
    879      unpackSizeDefined = true
    880      unpackSize contains unpack size
    881 
    882 
    883 The main loop of decoder
    884 ------------------------
    885 
    886 The main loop of LZMA decoder:
    887 
    888 Initialize the LZMA state.
    889 loop
    890 {
    891  // begin of loop
    892  Check "end of stream" conditions.
    893  Decode Type of MATCH / LITERAL. 
    894    If it's LITERAL, decode LITERAL value and put the LITERAL to Window.
    895    If it's MATCH, decode the length of match and the match distance. 
    896        Check error conditions, check end of stream conditions and copy
    897        the sequence of match bytes from sliding window to current position
    898        in window.
    899  Go to begin of loop
    900 }
    901 
    902 The reference implementation of LZMA decoder uses "unpackSize" variable
    903 to keep the number of remaining bytes in output stream. So it reduces 
    904 "unpackSize" value after each decoded LITERAL or MATCH.
    905 
    906 The following code contains the "end of stream" condition check at the start
    907 of the loop:
    908 
    909    if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory)
    910      if (RangeDec.IsFinishedOK())
    911        return LZMA_RES_FINISHED_WITHOUT_MARKER;
    912 
    913 LZMA uses three types of matches:
    914 
    915 1) "Simple Match" -     the match with distance value encoded with bit models.
    916 
    917 2) "Rep Match" -        the match that uses the distance from distance
    918                        history table.
    919 
    920 3) "Short Rep Match" -  the match of single byte length, that uses the latest 
    921                        distance from distance history table.
    922 
    923 The LZMA decoder keeps the history of latest 4 match distances that were used 
    924 by decoder. That set of 4 variables contains zero-based match distances and 
    925 these variables are initialized with zero values:
    926 
    927  UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0;
    928 
    929 The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:
    930 
    931 #define kNumStates 12
    932 #define kNumPosBitsMax 4
    933 
    934  CProb IsMatch[kNumStates << kNumPosBitsMax];
    935  CProb IsRep[kNumStates];
    936  CProb IsRepG0[kNumStates];
    937  CProb IsRepG1[kNumStates];
    938  CProb IsRepG2[kNumStates];
    939  CProb IsRep0Long[kNumStates << kNumPosBitsMax];
    940 
    941 The decoder uses "state" variable value to select exact variable 
    942 from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays.
    943 The "state" variable can get the value from 0 to 11.
    944 Initial value for "state" variable is zero:
    945 
    946  unsigned state = 0;
    947 
    948 The "state" variable is updated after each LITERAL or MATCH with one of the
    949 following functions:
    950 
    951 unsigned UpdateState_Literal(unsigned state)
    952 {
    953  if (state < 4) return 0;
    954  else if (state < 10) return state - 3;
    955  else return state - 6;
    956 }
    957 unsigned UpdateState_Match   (unsigned state) { return state < 7 ? 7 : 10; }
    958 unsigned UpdateState_Rep     (unsigned state) { return state < 7 ? 8 : 11; }
    959 unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; }
    960 
    961 The decoder calculates "state2" variable value to select exact variable from 
    962 "IsMatch" and "IsRep0Long" arrays:
    963 
    964 unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
    965 unsigned state2 = (state << kNumPosBitsMax) + posState;
    966 
    967 The decoder uses the following code flow scheme to select exact 
    968 type of LITERAL or MATCH:
    969 
    970 IsMatch[state2] decode
    971  0 - the Literal
    972  1 - the Match
    973    IsRep[state] decode
    974      0 - Simple Match
    975      1 - Rep Match
    976        IsRepG0[state] decode
    977          0 - the distance is rep0
    978            IsRep0Long[state2] decode
    979              0 - Short Rep Match
    980              1 - Rep Match 0
    981          1 - 
    982            IsRepG1[state] decode
    983              0 - Rep Match 1
    984              1 - 
    985                IsRepG2[state] decode
    986                  0 - Rep Match 2
    987                  1 - Rep Match 3
    988 
    989 
    990 LITERAL symbol
    991 --------------
    992 If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.
    993 
    994 At first the LZMA decoder must check that it doesn't exceed 
    995 specified uncompressed size:
    996 
    997      if (unpackSizeDefined && unpackSize == 0)
    998        return LZMA_RES_ERROR;
    999 
   1000 Then it decodes literal value and puts it to sliding window:
   1001 
   1002      DecodeLiteral(state, rep0);
   1003 
   1004 Then the decoder must update the "state" value and "unpackSize" value;
   1005 
   1006      state = UpdateState_Literal(state);
   1007      unpackSize--;
   1008 
   1009 Then the decoder must go to the begin of main loop to decode next Match or Literal.
   1010 
   1011 
   1012 Simple Match
   1013 ------------
   1014 
   1015 If the value "1" was decoded with IsMatch[state2] decoding,
   1016 we have the "Simple Match" type.
   1017 
   1018 The distance history table is updated with the following scheme:
   1019    
   1020      rep3 = rep2;
   1021      rep2 = rep1;
   1022      rep1 = rep0;
   1023 
   1024 The zero-based length is decoded with "LenDecoder":
   1025 
   1026      len = LenDecoder.Decode(&RangeDec, posState);
   1027 
   1028 The state is update with UpdateState_Match function:
   1029 
   1030      state = UpdateState_Match(state);
   1031 
   1032 and the new "rep0" value is decoded with DecodeDistance:
   1033 
   1034      rep0 = DecodeDistance(len);
   1035 
   1036 That "rep0" will be used as zero-based distance for current match.
   1037 
   1038 If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have 
   1039 "End of stream" marker, so we can stop decoding and check finishing 
   1040 condition in Range Decoder:
   1041 
   1042      if (rep0 == 0xFFFFFFFF)
   1043        return RangeDec.IsFinishedOK() ?
   1044            LZMA_RES_FINISHED_WITH_MARKER :
   1045            LZMA_RES_ERROR;
   1046 
   1047 If uncompressed size is defined, LZMA decoder must check that it doesn't 
   1048 exceed that specified uncompressed size:
   1049 
   1050      if (unpackSizeDefined && unpackSize == 0)
   1051        return LZMA_RES_ERROR;
   1052 
   1053 Also the decoder must check that "rep0" value is not larger than dictionary size
   1054 and is not larger than the number of already decoded bytes:
   1055 
   1056      if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0))
   1057        return LZMA_RES_ERROR;
   1058 
   1059 Then the decoder must copy match bytes as described in 
   1060 "The match symbols copying" section.
   1061 
   1062 
   1063 Rep Match
   1064 ---------
   1065 
   1066 If the LZMA decoder has decoded the value "1" with IsRep[state] variable,
   1067 we have "Rep Match" type.
   1068 
   1069 At first the LZMA decoder must check that it doesn't exceed 
   1070 specified uncompressed size:
   1071 
   1072      if (unpackSizeDefined && unpackSize == 0)
   1073        return LZMA_RES_ERROR;
   1074 
   1075 Also the decoder must return error, if the LZ window is empty:
   1076 
   1077      if (OutWindow.IsEmpty())
   1078        return LZMA_RES_ERROR;
   1079 
   1080 If the match type is "Rep Match", the decoder uses one of the 4 variables of
   1081 distance history table to get the value of distance for current match.
   1082 And there are 4 corresponding ways of decoding flow. 
   1083 
   1084 The decoder updates the distance history with the following scheme 
   1085 depending from type of match:
   1086 
   1087 - "Rep Match 0" or "Short Rep Match":
   1088      ; LZMA doesn't update the distance history    
   1089 
   1090 - "Rep Match 1":
   1091      UInt32 dist = rep1;
   1092      rep1 = rep0;
   1093      rep0 = dist;
   1094 
   1095 - "Rep Match 2":
   1096      UInt32 dist = rep2;
   1097      rep2 = rep1;
   1098      rep1 = rep0;
   1099      rep0 = dist;
   1100 
   1101 - "Rep Match 3":
   1102      UInt32 dist = rep3;
   1103      rep3 = rep2;
   1104      rep2 = rep1;
   1105      rep1 = rep0;
   1106      rep0 = dist;
   1107 
   1108 Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long",
   1109 "IsRepG1", "IsRepG2".
   1110 
   1111 If the subtype is "Short Rep Match", the decoder updates the state, puts 
   1112 the one byte from window to current position in window and goes to next 
   1113 MATCH/LITERAL symbol (the begin of main loop):
   1114 
   1115          state = UpdateState_ShortRep(state);
   1116          OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));
   1117          unpackSize--;
   1118          continue;
   1119 
   1120 In other cases (Rep Match 0/1/2/3), it decodes the zero-based 
   1121 length of match with "RepLenDecoder" decoder:
   1122 
   1123      len = RepLenDecoder.Decode(&RangeDec, posState);
   1124 
   1125 Then it updates the state:
   1126 
   1127      state = UpdateState_Rep(state);
   1128 
   1129 Then the decoder must copy match bytes as described in 
   1130 "The Match symbols copying" section.
   1131 
   1132 
   1133 The match symbols copying
   1134 -------------------------
   1135 
   1136 If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must
   1137 copy the sequence of bytes with calculated match distance and match length.
   1138 If uncompressed size is defined, LZMA decoder must check that it doesn't 
   1139 exceed that specified uncompressed size:
   1140 
   1141    len += kMatchMinLen;
   1142    bool isError = false;
   1143    if (unpackSizeDefined && unpackSize < len)
   1144    {
   1145      len = (unsigned)unpackSize;
   1146      isError = true;
   1147    }
   1148    OutWindow.CopyMatch(rep0 + 1, len);
   1149    unpackSize -= len;
   1150    if (isError)
   1151      return LZMA_RES_ERROR;
   1152 
   1153 Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.
   1154 
   1155 
   1156 
   1157 NOTES
   1158 -----
   1159 
   1160 This specification doesn't describe the variant of decoder implementation 
   1161 that supports partial decoding. Such partial decoding case can require some 
   1162 changes in "end of stream" condition checks code. Also such code 
   1163 can use additional status codes, returned by decoder.
   1164 
   1165 This specification uses C++ code with templates to simplify describing.
   1166 The optimized version of LZMA decoder doesn't need templates.
   1167 Such optimized version can use just two arrays of CProb variables:
   1168  1) The dynamic array of CProb variables allocated for the Literal Decoder.
   1169  2) The one common array that contains all other CProb variables.
   1170 
   1171 
   1172 References:      
   1173 
   1174 1. G. N. N. Martin, Range encoding: an algorithm for removing redundancy 
   1175   from a digitized message, Video & Data Recording Conference, 
   1176   Southampton, UK, July 24-27, 1979.
	tor-browser The Tor Browser
	git clone https://git.dasho.dev/tor-browser.git
	Log \| Files \| Refs \| README \| LICENSE