Clark Hash：針對神經嵌入的無狀態稀疏約翰遜-林登施特勞斯量化

摘要

Clark Hash 是一種以更少空間儲存神經網路嵌入向量的簡潔方法。它對每個資料庫向量進行歸一化，應用確定性稀疏帶符號的 Johnson-Lindenstrauss 投影，裁剪結果，並儲存固定寬度的標量量化編碼。查詢向量保持浮點數格式，並與儲存的草圖進行評分比對。在預設的 384 維句子嵌入設定中，Clark Hash 將餘弦搜尋向量儲存為 48 位元組，而密集的 f32 儲存則需要 1536 位元組，體積縮小 32 倍。此方法無需訓練過程、學習式碼本、旋轉或語料庫統計數據即可儲存新向量。我們描述了此編解碼器、Rust 實現方式，以及針對來自 29 個子集的 9,304 組標註配對所進行的多語言句子相似度評估。使用多語言 MiniLM 編碼器時，48 位元組的草圖在 STS17 和 STS22 資料集上，與密集餘弦分數的巨觀皮爾森相關係數分別達到 0.910 與 0.946。Clark Hash 並非新的 Johnson-Lindenstrauss 定理，也非近似最近鄰索引的替代方案，而是一種用於緊湊嵌入儲存的簡單無狀態編解碼器。

English

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.