당신의 언임베딩 행렬은 사실상 텍스트 임베딩을 위한 특징 렌즈입니다.

초록

대규모 언어 모델은 다양한 하위 작업에서 뛰어난 제로샷(zero-shot) 능력을 보여준다. 그러나 기성 임베딩 모델(off-the-shelf embedding model)로는 제대로 작동하지 못하여, 대규모 텍스트 임베딩 벤치마크에서 최적 이하의 성능을 보인다. 본 논문에서는 이러한 결함의 잠재적 원인을 식별한다. 우리의 동기는 예상치 못한 관찰에서 비롯되었다: 텍스트 임베딩이 어휘 공간에 투영될 때 빈번하지만 정보가 없는 토큰과 정렬되는 경향이 있다는 점이다. 우리는 이러한 고빈도 토큰의 과도한 표현이 모델의 미묘한 의미를 포착하는 능력을 억제한다고 주장한다. 이를 해결하기 위해, 우리는 EmbedFilter를 도입한다. 이는 LLM에서 직접 얻은 텍스트 임베딩을 정제하도록 설계된 간단한 선형 변환이다. 구체적으로, LLM 내의 언임베딩 행렬(unembedding matrix)이 이러한 빈번한 토큰을 임베딩 공간에 적극적으로 기록하는 잠재 공간을 인코딩하고 있음을 발견한다. 이 부분 공간을 필터링함으로써 EmbedFilter는 고빈도 토큰의 영향을 억제하여 의미 표현을 향상시킨다. 주목할 만한 부산물로, 이는 고유한 차원 축소를 가능하게 하여 정제된 임베딩 품질을 완전히 유지하면서 인덱스 저장 공간을 줄이고 검색 속도를 높인다. 여러 LLM 백본에 걸친 실험을 통해 EmbedFilter를 장착한 LLM이 임베딩 차원을 크게 줄인 상태에서도 뛰어난 제로샷 하위 작업 성능을 달성함을 입증한다. 우리의 연구 결과가 LLM 기반 표현의 메커니즘에 대한 더 깊은 통찰을 제공하고, 텍스트 임베딩 훈련을 개선하기 위한 더 원칙적인 설계에 영감을 주기를 기대한다. 코드는 https://github.com/CentreChen/EmbFilter에서 확인할 수 있다.

English

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.