Clark Hash: 신경 임베딩을 위한 무상태 희소 존슨-린덴스트라우스 양자화

초록

Clark Hash는 신경망 임베딩을 적은 공간에 저장하기 위한 간단한 방법이다. 각 데이터베이스 벡터를 정규화하고, 결정적 희소 부호화된 존슨-린덴스트라우스 투영을 적용한 후 결과를 클리핑하여 고정 폭 스칼라 양자화 코드를 저장한다. 쿼리는 부동 소수점으로 유지되며 저장된 스케치와 비교하여 점수를 매긴다. 기본 384차원 문장 임베딩 설정에서 Clark Hash는 코사인 검색 벡터를 고밀도 f32 저장소의 1536바이트 대신 48바이트에 저장한다. 이는 32배 작은 크기이다. 이 방법은 새 벡터를 저장하기 전에 학습 과정, 학습된 코드북, 회전, 또는 코퍼스 통계를 필요로 하지 않는다. 본 연구에서는 이 코덱, Rust 구현, 그리고 29개 하위 집합의 9,304개 레이블이 있는 쌍에 대한 다국어 문장 유사도 평가를 설명한다. 다국어 MiniLM 인코더를 사용한 48바이트 스케치는 STS17과 STS22에서 고밀도 코사인 점수와의 매크로 피어슨 상관관계가 각각 0.910과 0.946에 도달했다. Clark Hash는 새로운 존슨-린덴스트라우스 정리가 아니며, 근사 최근접 이웃 인덱스를 대체하지 않는다. 이는 컴팩트한 임베딩 저장을 위한 단순한 무상태 코덱이다.

English

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.