Clark Hash: ニューラル埋め込みのためのステートレススパースジョンソン-リンデンシュトラウス量子化

要旨

Clark Hashは、ニューラル埋め込みをより少ない容量で格納するための小さな手法です。各データベースベクトルを正規化し、決定論的なスパース符号付きジョンソン-リンデンシュトラウス射影を適用し、結果をクリッピングし、固定幅のスカラー量子化コードを格納します。クエリは浮動小数点のままで、格納されたスケッチとスコアリングされます。デフォルトの384次元文埋め込み設定では、Clark Hashはコサイン検索ベクトルを48バイトで格納します。これは、密なf32ストレージの1536バイトと比較して32倍小さいです。この手法は、新しいベクトルを格納する前に、学習パス、学習済みコードブック、回転、コーパス統計を必要としません。本稿では、コーデック、Rust実装、および29のサブセットからの9,304のラベル付きペアを用いた多言語文類似性評価について説明します。多言語MiniLMエンコーダを用いた場合、48バイトのスケッチは、STS17およびSTS22において、密なコサインスコアとのマクロピアソン相関がそれぞれ0.910および0.946に達しました。Clark Hashは新しいジョンソン-リンデンシュトラウスの定理ではなく、近似最近傍インデックスの代替でもありません。これは、コンパクトな埋め込みストレージのためのシンプルなステートレスコーデックです。

English

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.