Coverage report: /home/ellis/comp/core/ffi/zstd/dict.lisp
Kind | Covered | All | % |
expression | 5 | 44 | 11.4 |
branch | 0 | 0 | nil |
Key
Not instrumented
Conditionalized out
Executed
Not executed
Both branches taken
One branch taken
Neither branch taken
1
;;; dict.lisp --- Zstd Dictionary API
7
;; The CDict can be created once and shared across multiple threads since it's
10
;; Unclear if DDict is also read-only.
14
* Zstd dictionary builder
18
* Why should I use a dictionary?
19
* ------------------------------
21
* Zstd can use dictionaries to improve compression ratio of small data.
22
* Traditionally small files don't compress well because there is very little
23
* repetition in a single sample, since it is small. But, if you are compressing
24
* many similar files, like a bunch of JSON records that share the same
25
* structure, you can train a dictionary on ahead of time on some samples of
26
* these files. Then, zstd can use the dictionary to find repetitions that are
27
* present across samples. This can vastly improve compression ratio.
29
* When is a dictionary useful?
30
* ----------------------------
32
* Dictionaries are useful when compressing many small files that are similar.
33
* The larger a file is, the less benefit a dictionary will have. Generally,
34
* we don't expect dictionary compression to be effective past 100KB. And the
35
* smaller a file is, the more we would expect the dictionary to help.
37
* How do I use a dictionary?
38
* --------------------------
40
* Simply pass the dictionary to the zstd compressor with
41
* `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to
42
* the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other
43
* more advanced functions that allow selecting some options, see zstd.h for
44
* complete documentation.
46
* What is a zstd dictionary?
47
* --------------------------
49
* A zstd dictionary has two pieces: Its header, and its content. The header
50
* contains a magic number, the dictionary ID, and entropy tables. These
51
* entropy tables allow zstd to save on header costs in the compressed file,
52
* which really matters for small data. The content is just bytes, which are
53
* repeated content that is common across many samples.
55
* What is a raw content dictionary?
56
* ---------------------------------
58
* A raw content dictionary is just bytes. It doesn't have a zstd dictionary
59
* header, a dictionary ID, or entropy tables. Any buffer is a valid raw
62
* How do I train a dictionary?
63
* ----------------------------
65
* Gather samples from your use case. These samples should be similar to each
66
* other. If you have several use cases, you could try to train one dictionary
69
* Pass those samples to `ZDICT_trainFromBuffer()` and that will train your
70
* dictionary. There are a few advanced versions of this function, but this
71
* is a great starting point. If you want to further tune your dictionary
72
* you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow
73
* you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.
75
* If the dictionary training function fails, that is likely because you
76
* either passed too few samples, or a dictionary would not be effective
77
* for your data. Look at the messages that the dictionary trainer printed,
78
* if it doesn't say too few samples, then a dictionary would not be effective.
80
* How large should my dictionary be?
81
* ----------------------------------
83
* A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.
84
* The zstd CLI defaults to a 110KB dictionary. You likely don't need a
85
* dictionary larger than that. But, most use cases can get away with a
86
* smaller dictionary. The advanced dictionary builders can automatically
87
* shrink the dictionary for you, and select the smallest size that doesn't
88
* hurt compression ratio too much. See the `shrinkDict` parameter.
89
* A smaller dictionary can save memory, and potentially speed up
92
* How many samples should I provide to the dictionary builder?
93
* ------------------------------------------------------------
95
* We generally recommend passing ~100x the size of the dictionary
96
* in samples. A few thousand should suffice. Having too few samples
97
* can hurt the dictionaries effectiveness. Having more samples will
98
* only improve the dictionaries effectiveness. But having too many
99
* samples can slow down the dictionary builder.
101
* How do I determine if a dictionary will be effective?
102
* -----------------------------------------------------
104
* Simply train a dictionary and try it out. You can use zstd's built in
105
* benchmarking tool to test the dictionary effectiveness.
107
* # Benchmark levels 1-3 without a dictionary
108
* zstd -b1e3 -r /path/to/my/files
109
* # Benchmark levels 1-3 with a dictionary
110
* zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
112
* When should I retrain a dictionary?
113
* -----------------------------------
115
* You should retrain a dictionary when its effectiveness drops. Dictionary
116
* effectiveness drops as the data you are compressing changes. Generally, we do
117
* expect dictionaries to "decay" over time, as your data changes, but the rate
118
* at which they decay depends on your use case. Internally, we regularly
119
* retrain dictionaries, and if the new dictionary performs significantly
120
* better than the old dictionary, we will ship the new dictionary.
122
* I have a raw content dictionary, how do I turn it into a zstd dictionary?
123
* -------------------------------------------------------------------------
125
* If you have a raw content dictionary, e.g. by manually constructing it, or
126
* using a third-party dictionary builder, you can turn it into a zstd
127
* dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to
128
* provide some samples of the data. It will add the zstd header to the
129
* raw content, which contains a dictionary ID and entropy tables, which
130
* will improve compression ratio, and allow zstd to write the dictionary ID
131
* into the frame, if you so choose.
133
* Do I have to use zstd's dictionary builder?
134
* -------------------------------------------
136
* No! You can construct dictionary content however you please, it is just
137
* bytes. It will always be valid as a raw content dictionary. If you want
138
* a zstd dictionary, which can improve compression ratio, use
139
* `ZDICT_finalizeDictionary()`.
141
* What is the attack surface of a zstd dictionary?
142
* ------------------------------------------------
144
* Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so
145
* zstd should never crash, or access out-of-bounds memory no matter what
146
* the dictionary is. However, if an attacker can control the dictionary
147
* during decompression, they can cause zstd to generate arbitrary bytes,
148
* just like if they controlled the compressed data.
150
******************************************************************************/
153
/*! ZDICT_trainFromBuffer():
154
* Train a dictionary from an array of samples.
155
* Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
157
* Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
158
* supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
159
* The resulting dictionary will be saved into `dictBuffer`.
160
* @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
161
* or an error code, which can be tested with ZDICT_isError().
162
* Note: Dictionary training will fail if there are not enough samples to construct a
163
* dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
164
* If dictionary training fails, you should use zstd without a dictionary, as the dictionary
165
* would've been ineffective anyways. If you believe your samples would benefit from a dictionary
166
* please open an issue with details, and we can look into it.
167
* Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
168
* Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
169
* It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
170
* In general, it's recommended to provide a few thousands samples, though this can vary a lot.
171
* It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
176
(deferror zstd-ddict-error (zstd-alien-error) ())
177
(deferror zstd-cdict-error (zstd-alien-error)
179
(:report (lambda (c s)
180
(format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c))))))
182
(define-alien-enum (zstd-dict-content-type int)
187
(define-alien-enum (zstd-dict-load-method int)
191
(define-alien-enum (zstd-force-ignore-checksum int)
195
(define-alien-enum (zstd-ref-multiple-ddicts int)
197
:ref-multiple-ddicts 1)
199
(define-alien-enum (zstd-dict-attach-pref int)
205
(define-alien-enum (zstd-literal-compression-mode int)
210
(define-alien-enum (zstd-param-switch int)
215
(define-alien-enum (zstd-frame-type int)
219
(define-alien-enum (zstd-sequence-format int)
220
:no-block-delimiters 0
221
:explicit-block-delimiters 1)
223
;;; Simple Dictionary API
224
(defar "ZSTD_compress_usingDict" size-t
227
(dst-capacity size-t)
232
(compression-level int))
234
(defar "ZSTD_decompress_usingDict" size-t
237
(dst-capacity size-t)
243
;;; Bulk-processing Dictionary API
244
(define-alien-type zstd-cdict (struct zstd-cdict-s))
246
(defar "ZSTD_createCDict" (* zstd-cdict)
249
(compression-level int))
251
(defar "ZSTD_freeCDict" size-t (cdict (* zstd-cdict)))
253
(defar "ZSTD_compress_usingCDict" size-t
256
(dst-capacity size-t)
259
(cdict (* zstd-cdict)))
261
(define-alien-type zstd-ddict (struct zstd-ddict-s))
263
(defar "ZSTD_createDDict" (* zstd-ddict)
267
(defar "ZSTD_freeDDict" size-t (ddict (* zstd-ddict)))
269
(defar "ZSTD_decompress_usingDDict" size-t
272
(dst-capacity size-t)
275
(ddict (* zstd-ddict)))
278
(defar "ZSTD_getDictID_fromDict" unsigned
282
(defar "ZSTD_getDictID_fromCDict" unsigned
283
(cdict (* zstd-cdict)))
285
(defar "ZSTD_getDictID_fromDDict" unsigned
286
(cdict (* zstd-ddict)))
288
(defar "ZSTD_getDictID_fromFrame" unsigned
292
(defar "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method))
294
(defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body)
295
`(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t))
296
(or ,size (length ,buffer))
298
(unwind-protect (progn ,@body)
299
(zstd-freecdict ,cv))))
301
(defmacro with-zstd-ddict ((dv &key buffer size) &body body)
302
`(with-alien ((,dv (* zstd-ddict)
303
(zstd-createddict (cast (octets-to-alien ,buffer) (* t)) (or ,size (length ,buffer)))))
304
(unwind-protect (progn ,@body)
305
(zstd-freeddict ,dv))))
308
(define-alien-type zstd-cover-params
309
(struct zdict-cover-params
313
(nb-threads unsigned)
315
(shrink-dict unsigned)
316
(shrink-dict-max-regression unsigned)
317
(zparams zdict-params)))
319
(defar ("ZDICT_trainFromBuffer" zdict-train-from-buffer) size-t
321
(dict-buffer-capacity size-t)
322
(samples-buffer (* t))
323
(samples-sizes (* size-t))
324
(nb-samples unsigned))
326
;; NOTE: Requires returning struct by value
328
;; This is the ONLY function which used libzstd-alien.so right now.
329
(defar ("ZDICT_finalizeDictionaryWithParams" zdict-finalize-dictionary) size-t
330
(dst-dict-buffer (* t))
331
(max-dict-size size-t)
333
(dict-content-size size-t)
334
(samples-buffer (* t))
335
(samples-sizes (* size-t))
336
(nb-samples unsigned)
337
(parameters (* zdict-params)))
339
(defar ("ZDICT_getDictID" zdict-get-dict-id) unsigned
343
(defar ("ZDICT_getDictHeaderSize" zdict-get-dict-header-size) size-t
347
(defar ("ZDICT_isError" zdict-is-error) unsigned