Clark Hash: Staatloze Sparse Johnson-Lindenstrauss Kwantisatie voor Neurale Embeddings

Samenvatting

Clark Hash is een compacte methode voor het opslaan van neurale embeddings met minder ruimte. Het normaliseert elke databasevector, past een deterministische sparse signed Johnson-Lindenstrauss-projectie toe, clipt het resultaat en slaat een vastbrede scalaar-gekwantiseerde code op. Query’s blijven in floating point en worden gescoord tegen de opgeslagen sketches. In de standaard 384-dimensionale zin-embedding-setting slaat Clark Hash een cosinuszoekvector op in 48 bytes in plaats van 1536 bytes voor dichte f32-opslag. Dit is 32 keer kleiner. De methode heeft geen trainingsronde, aangeleerde codeboeken, rotaties of corpusstatistieken nodig voordat nieuwe vectoren kunnen worden opgeslagen. We beschrijven de codec, de Rust-implementatie en een meertalige zinsgelijkheidsbeoordeling op 9.304 gelabelde paren uit 29 deelverzamelingen. Met een meertalige MiniLM-encoder bereikten de 48-byte sketches een macro Pearson-correlatie van 0,910 en 0,946 met dichte cosinusscores op STS17 en STS22. Clark Hash is geen nieuwe Johnson-Lindenstrauss-stelling en het is geen vervanging voor benaderende naaste-buur indexen. Het is een eenvoudige toestandsloze codec voor compacte embeddingopslag.

English

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.