Coverage report: /home/ellis/comp/core/ffi/zstd/dict.lisp

Kind	Covered	All	%
expression	5	44	11.4
branch	0	0	nil

Key

Not instrumented

Conditionalized out

Executed

Not executed

Both branches taken

One branch taken

Neither branch taken

1

;;; dict.lisp --- Zstd Dictionary API

2

3

;;

4

5

;;; Commentary:

6

7

;; The CDict can be created once and shared across multiple threads since it's

8

;; read-only.

9

10

;; Unclear if DDict is also read-only.

11

12

;; From zdict.h:

13

#|

14

* Zstd dictionary builder

15

*

16

* FAQ

17

* ===

18

* Why should I use a dictionary?

19

* ------------------------------

20

*

21

* Zstd can use dictionaries to improve compression ratio of small data.

22

* Traditionally small files don't compress well because there is very little

23

* repetition in a single sample, since it is small. But, if you are compressing

24

* many similar files, like a bunch of JSON records that share the same

25

* structure, you can train a dictionary on ahead of time on some samples of

26

* these files. Then, zstd can use the dictionary to find repetitions that are

27

* present across samples. This can vastly improve compression ratio.

28

*

29

* When is a dictionary useful?

30

* ----------------------------

31

*

32

* Dictionaries are useful when compressing many small files that are similar.

33

* The larger a file is, the less benefit a dictionary will have. Generally,

34

* we don't expect dictionary compression to be effective past 100KB. And the

35

* smaller a file is, the more we would expect the dictionary to help.

36

*

37

* How do I use a dictionary?

38

* --------------------------

39

*

40

* Simply pass the dictionary to the zstd compressor with

41

* `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to

42

* the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other

43

* more advanced functions that allow selecting some options, see zstd.h for

44

* complete documentation.

45

*

46

* What is a zstd dictionary?

47

* --------------------------

48

*

49

* A zstd dictionary has two pieces: Its header, and its content. The header

50

* contains a magic number, the dictionary ID, and entropy tables. These

51

* entropy tables allow zstd to save on header costs in the compressed file,

52

* which really matters for small data. The content is just bytes, which are

53

* repeated content that is common across many samples.

54

*

55

* What is a raw content dictionary?

56

* ---------------------------------

57

*

58

* A raw content dictionary is just bytes. It doesn't have a zstd dictionary

59

* header, a dictionary ID, or entropy tables. Any buffer is a valid raw

60

* content dictionary.

61

*

62

* How do I train a dictionary?

63

* ----------------------------

64

*

65

* Gather samples from your use case. These samples should be similar to each

66

* other. If you have several use cases, you could try to train one dictionary

67

* per use case.

68

*

69

* Pass those samples to `ZDICT_trainFromBuffer()` and that will train your

70

* dictionary. There are a few advanced versions of this function, but this

71

* is a great starting point. If you want to further tune your dictionary

72

* you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow

73

* you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.

74

*

75

* If the dictionary training function fails, that is likely because you

76

* either passed too few samples, or a dictionary would not be effective

77

* for your data. Look at the messages that the dictionary trainer printed,

78

* if it doesn't say too few samples, then a dictionary would not be effective.

79

*

80

* How large should my dictionary be?

81

* ----------------------------------

82

*

83

* A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.

84

* The zstd CLI defaults to a 110KB dictionary. You likely don't need a

85

* dictionary larger than that. But, most use cases can get away with a

86

* smaller dictionary. The advanced dictionary builders can automatically

87

* shrink the dictionary for you, and select the smallest size that doesn't

88

* hurt compression ratio too much. See the `shrinkDict` parameter.

89

* A smaller dictionary can save memory, and potentially speed up

90

* compression.

91

*

92

* How many samples should I provide to the dictionary builder?

93

* ------------------------------------------------------------

94

*

95

* We generally recommend passing ~100x the size of the dictionary

96

* in samples. A few thousand should suffice. Having too few samples

97

* can hurt the dictionaries effectiveness. Having more samples will

98

* only improve the dictionaries effectiveness. But having too many

99

* samples can slow down the dictionary builder.

100

*

101

* How do I determine if a dictionary will be effective?

102

* -----------------------------------------------------

103

*

104

* Simply train a dictionary and try it out. You can use zstd's built in

105

* benchmarking tool to test the dictionary effectiveness.

106

*

107

* # Benchmark levels 1-3 without a dictionary

108

* zstd -b1e3 -r /path/to/my/files

109

* # Benchmark levels 1-3 with a dictionary

110

* zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary

111

*

112

* When should I retrain a dictionary?

113

* -----------------------------------

114

*

115

* You should retrain a dictionary when its effectiveness drops. Dictionary

116

* effectiveness drops as the data you are compressing changes. Generally, we do

117

* expect dictionaries to "decay" over time, as your data changes, but the rate

118

* at which they decay depends on your use case. Internally, we regularly

119

* retrain dictionaries, and if the new dictionary performs significantly

120

* better than the old dictionary, we will ship the new dictionary.

121

*

122

* I have a raw content dictionary, how do I turn it into a zstd dictionary?

123

* -------------------------------------------------------------------------

124

*

125

* If you have a raw content dictionary, e.g. by manually constructing it, or

126

* using a third-party dictionary builder, you can turn it into a zstd

127

* dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to

128

* provide some samples of the data. It will add the zstd header to the

129

* raw content, which contains a dictionary ID and entropy tables, which

130

* will improve compression ratio, and allow zstd to write the dictionary ID

131

* into the frame, if you so choose.

132

*

133

* Do I have to use zstd's dictionary builder?

134

* -------------------------------------------

135

*

136

* No! You can construct dictionary content however you please, it is just

137

* bytes. It will always be valid as a raw content dictionary. If you want

138

* a zstd dictionary, which can improve compression ratio, use

139

* `ZDICT_finalizeDictionary()`.

140

*

141

* What is the attack surface of a zstd dictionary?

142

* ------------------------------------------------

143

*

144

* Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so

145

* zstd should never crash, or access out-of-bounds memory no matter what

146

* the dictionary is. However, if an attacker can control the dictionary

147

* during decompression, they can cause zstd to generate arbitrary bytes,

148

* just like if they controlled the compressed data.

149

*

150

******************************************************************************/

151

152

153

/*! ZDICT_trainFromBuffer():

154

* Train a dictionary from an array of samples.

155

* Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,

156

* f=20, and accel=1.

157

* Samples must be stored concatenated in a single flat buffer `samplesBuffer`,

158

* supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.

159

* The resulting dictionary will be saved into `dictBuffer`.

160

* @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)

161

* or an error code, which can be tested with ZDICT_isError().

162

* Note: Dictionary training will fail if there are not enough samples to construct a

163

  *         dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).

164

  *         If dictionary training fails, you should use zstd without a dictionary, as the dictionary

165

  *         would've been ineffective anyways. If you believe your samples would benefit from a dictionary

166

* please open an issue with details, and we can look into it.

167

* Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.

168

* Tips: In general, a reasonable dictionary has a size of ~ 100 KB.

169

* It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.

170

  *        In general, it's recommended to provide a few thousands samples, though this can vary a lot.

171

  *        It's recommended that total size of all samples be about ~x100 times the target size of dictionary.

172

*/

173

|#

174

;;; Code:

175

(in-package :zstd)

176

(deferror zstd-ddict-error (zstd-alien-error) ())

177

(deferror zstd-cdict-error (zstd-alien-error)

178

()

179

(:report (lambda (c s)

180

                (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c))))))

181

182

(define-alien-enum (zstd-dict-content-type int)

183

:auto 0

184

:raw-content 1

185

:full-dict 2)

186

187

(define-alien-enum (zstd-dict-load-method int)

188

:by-copy 0

189

:by-ref 1)

190

191

(define-alien-enum (zstd-force-ignore-checksum int)

192

:validate-checksum 0

193

:ignore-checksum 1)

194

195

(define-alien-enum (zstd-ref-multiple-ddicts int)

196

:ref-single-ddict 0

197

:ref-multiple-ddicts 1)

198

199

(define-alien-enum (zstd-dict-attach-pref int)

200

:default-attach 0

201

:force-attach 1

202

:force-copy 2

203

:force-load 3)

204

205

(define-alien-enum (zstd-literal-compression-mode int)

206

:auto 0

207

:huffman 1

208

:uncompressed 2)

209

210

(define-alien-enum (zstd-param-switch int)

211

:auto 0

212

:enable 1

213

:disable 2)

214

215

(define-alien-enum (zstd-frame-type int)

216

:frame 0

217

:skippable-frame 1)

218

219

(define-alien-enum (zstd-sequence-format int)

220

:no-block-delimiters 0

221

:explicit-block-delimiters 1)

222

223

;;; Simple Dictionary API

224

(defar "ZSTD_compress_usingDict" size-t

225

(cctx (* zstd-cctx))

226

(dst (* t))

227

(dst-capacity size-t)

228

(src (* t))

229

(src-size size-t)

230

(dict (* t))

231

(dict-size size-t)

232

(compression-level int))

233

234

(defar "ZSTD_decompress_usingDict" size-t

235

(dctx (* zstd-dctx))

236

(dst (* t))

237

(dst-capacity size-t)

238

(src (* t))

239

(src-size size-t)

240

(dict (* t))

241

(dict-size size-t))

242

243

;;; Bulk-processing Dictionary API

244

(define-alien-type zstd-cdict (struct zstd-cdict-s))

245

246

(defar "ZSTD_createCDict" (* zstd-cdict)

247

(dict-buffer (* t))

248

(dict-size size-t)

249

(compression-level int))

250

251

(defar "ZSTD_freeCDict" size-t (cdict (* zstd-cdict)))

252

253

(defar "ZSTD_compress_usingCDict" size-t

254

(cctx (* zstd-cctx))

255

(dst (* t))

256

(dst-capacity size-t)

257

(src (* t))

258

(src-size size-t)

259

(cdict (* zstd-cdict)))

260

261

(define-alien-type zstd-ddict (struct zstd-ddict-s))

262

263

(defar "ZSTD_createDDict" (* zstd-ddict)

264

(dict-buffer (* t))

265

(dict-size size-t))

266

267

(defar "ZSTD_freeDDict" size-t (ddict (* zstd-ddict)))

268

269

(defar "ZSTD_decompress_usingDDict" size-t

270

(dctx (* zstd-dctx))

271

(dst (* t))

272

(dst-capacity size-t)

273

(src (* t))

274

(src-size size-t)

275

(ddict (* zstd-ddict)))

276

277

;; dictionary utils

278

(defar "ZSTD_getDictID_fromDict" unsigned

279

(dict (* t))

280

(dict-size size-t))

281

282

(defar "ZSTD_getDictID_fromCDict" unsigned

283

(cdict (* zstd-cdict)))

284

285

(defar "ZSTD_getDictID_fromDDict" unsigned

286

(cdict (* zstd-ddict)))

287

288

(defar "ZSTD_getDictID_fromFrame" unsigned

289

(src (* t))

290

(src-size size-t))

291

292

(defar "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method))

293

294

(defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body)

295

`(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t))

296

                                                       (or ,size (length ,buffer))

297

,level)))

298

(unwind-protect (progn ,@body)

299

(zstd-freecdict ,cv))))

300

301

(defmacro with-zstd-ddict ((dv &key buffer size) &body body)

302

`(with-alien ((,dv (* zstd-ddict)

303

                      (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) (or ,size (length ,buffer)))))

304

(unwind-protect (progn ,@body)

305

(zstd-freeddict ,dv))))

306

307

;;; zdict.h

308

(define-alien-type zstd-cover-params

309

(struct zdict-cover-params

310

(k unsigned)

311

(d unsigned)

312

(steps unsigned)

313

(nb-threads unsigned)

314

(split-point double)

315

(shrink-dict unsigned)

316

(shrink-dict-max-regression unsigned)

317

(zparams zdict-params)))

318

319

(defar ("ZDICT_trainFromBuffer" zdict-train-from-buffer) size-t

320

(dict-buffer (* t))

321

(dict-buffer-capacity size-t)

322

(samples-buffer (* t))

323

(samples-sizes (* size-t))

324

(nb-samples unsigned))

325

326

;; NOTE: Requires returning struct by value

327

328

;; This is the ONLY function which used libzstd-alien.so right now.

329

(defar ("ZDICT_finalizeDictionaryWithParams" zdict-finalize-dictionary) size-t

330

(dst-dict-buffer (* t))

331

(max-dict-size size-t)

332

(dict-content (* t))

333

(dict-content-size size-t)

334

(samples-buffer (* t))

335

(samples-sizes (* size-t))

336

(nb-samples unsigned)

337

(parameters (* zdict-params)))

338

339

(defar ("ZDICT_getDictID" zdict-get-dict-id) unsigned

340

(dict-buffer (* t))

341

(dict-size size-t))

342

343

(defar ("ZDICT_getDictHeaderSize" zdict-get-dict-header-size) size-t

344

(dict-buffer (* t))

345

(dict-size size-t))

346

347

(defar ("ZDICT_isError" zdict-is-error) unsigned

348

(error-code size-t))