克拉克哈希：用于神经嵌入的无状态稀疏约翰逊-林登斯特劳斯量化

摘要

Clark Hash是一种占用空间更小的神经嵌入存储方法。该方法对每个数据库向量进行归一化处理，应用确定性稀疏有符号Johnson-Lindenstrauss投影，对结果进行裁剪，并存储固定宽度的标量量化编码。查询保持浮点格式，并与存储的草图进行评分。在默认的384维句子嵌入设置中，Clark Hash将余弦搜索向量存储为48字节，而密集f32存储需要1536字节，缩减了32倍。该方法无需训练过程、学习型码本、旋转操作或语料库统计即可存储新向量。我们描述了该编解码器、其Rust实现，并在来自29个子集的9,304个标记对上进行了多语言句子相似性评估。使用多语言MiniLM编码器时，48字节草图在STS17和STS22上分别达到了0.910和0.946的宏观皮尔逊相关系数（与密集余弦评分相比）。Clark Hash并非新的Johnson-Lindenstrauss定理，也不能替代近似最近邻索引。它是一种用于紧凑嵌入存储的简单无状态编解码器。

English

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.