framing_format.txt (5039B)
1 Snappy framing format description 2 Last revised: 2013-10-25 3 4 This format decribes a framing format for Snappy, allowing compressing to 5 files or streams that can then more easily be decompressed without having 6 to hold the entire stream in memory. It also provides data checksums to 7 help verify integrity. It does not provide metadata checksums, so it does 8 not protect against e.g. all forms of truncations. 9 10 Implementation of the framing format is optional for Snappy compressors and 11 decompressor; it is not part of the Snappy core specification. 12 13 14 1. General structure 15 16 The file consists solely of chunks, lying back-to-back with no padding 17 in between. Each chunk consists first a single byte of chunk identifier, 18 then a three-byte little-endian length of the chunk in bytes (from 0 to 19 16777215, inclusive), and then the data if any. The four bytes of chunk 20 header is not counted in the data length. 21 22 The different chunk types are listed below. The first chunk must always 23 be the stream identifier chunk (see section 4.1, below). The stream 24 ends when the file ends -- there is no explicit end-of-file marker. 25 26 27 2. File type identification 28 29 The following identifiers for this format are recommended where appropriate. 30 However, note that none have been registered officially, so this is only to 31 be taken as a guideline. We use "Snappy framed" to distinguish between this 32 format and raw Snappy data. 33 34 File extension: .sz 35 MIME type: application/x-snappy-framed 36 HTTP Content-Encoding: x-snappy-framed 37 38 39 3. Checksum format 40 41 Some chunks have data protected by a checksum (the ones that do will say so 42 explicitly). The checksums are always masked CRC-32Cs. 43 44 A description of CRC-32C can be found in RFC 3720, section 12.1, with 45 examples in section B.4. 46 47 Checksums are not stored directly, but masked, as checksumming data and 48 then its own checksum can be problematic. The masking is the same as used 49 in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant 50 0xa282ead8 (using wraparound as normal for unsigned integers). This is 51 equivalent to the following C code: 52 53 uint32_t mask_checksum(uint32_t x) { 54 return ((x >> 15) | (x << 17)) + 0xa282ead8; 55 } 56 57 Note that the masking is reversible. 58 59 The checksum is always stored as a four bytes long integer, in little-endian. 60 61 62 4. Chunk types 63 64 The currently supported chunk types are described below. The list may 65 be extended in the future. 66 67 68 4.1. Stream identifier (chunk type 0xff) 69 70 The stream identifier is always the first element in the stream. 71 It is exactly six bytes long and contains "sNaPpY" in ASCII. This means that 72 a valid Snappy framed stream always starts with the bytes 73 74 0xff 0x06 0x00 0x00 0x73 0x4e 0x61 0x50 0x70 0x59 75 76 The stream identifier chunk can come multiple times in the stream besides 77 the first; if such a chunk shows up, it should simply be ignored, assuming 78 it has the right length and contents. This allows for easy concatenation of 79 compressed files without the need for re-framing. 80 81 82 4.2. Compressed data (chunk type 0x00) 83 84 Compressed data chunks contain a normal Snappy compressed bitstream; 85 see the compressed format specification. The compressed data is preceded by 86 the CRC-32C (see section 3) of the _uncompressed_ data. 87 88 Note that the data portion of the chunk, i.e., the compressed contents, 89 can be at most 16777211 bytes (2^24 - 1, minus the checksum). 90 However, we place an additional restriction that the uncompressed data 91 in a chunk must be no longer than 65536 bytes. This allows consumers to 92 easily use small fixed-size buffers. 93 94 95 4.3. Uncompressed data (chunk type 0x01) 96 97 Uncompressed data chunks allow a compressor to send uncompressed, 98 raw data; this is useful if, for instance, uncompressible or 99 near-incompressible data is detected, and faster decompression is desired. 100 101 As in the compressed chunks, the data is preceded by its own masked 102 CRC-32C (see section 3). 103 104 An uncompressed data chunk, like compressed data chunks, should contain 105 no more than 65536 data bytes, so the maximum legal chunk length with the 106 checksum is 65540. 107 108 109 4.4. Padding (chunk type 0xfe) 110 111 Padding chunks allow a compressor to increase the size of the data stream 112 so that it complies with external demands, e.g. that the total number of 113 bytes is a multiple of some value. 114 115 All bytes of the padding chunk, except the chunk byte itself and the length, 116 should be zero, but decompressors must not try to interpret or verify the 117 padding data in any way. 118 119 120 4.5. Reserved unskippable chunks (chunk types 0x02-0x7f) 121 122 These are reserved for future expansion. A decoder that sees such a chunk 123 should immediately return an error, as it must assume it cannot decode the 124 stream correctly. 125 126 Future versions of this specification may define meanings for these chunks. 127 128 129 4.6. Reserved skippable chunks (chunk types 0x80-0xfd) 130 131 These are also reserved for future expansion, but unlike the chunks 132 described in 4.5, a decoder seeing these must skip them and continue 133 decoding. 134 135 Future versions of this specification may define meanings for these chunks.