あなたの逆埋め込み行列は、実はテキスト埋め込みのための特徴レンズである

要旨

大規模言語モデルは、多様な下流タスクにおいて印象的なゼロショット能力を示します。しかし、既製の埋め込みモデルとして機能するには難点があり、大規模なテキスト埋め込みベンチマークにおいて最適とは言えない性能に留まっています。本論文では、この欠点の背後にある潜在的な原因を特定します。我々の動機は、予期せぬ観察に基づいています。テキスト埋め込みを語彙空間に投影すると、頻繁に出現するが情報量の少ないトークンと一致する傾向があるのです。この高頻度トークンの過剰な表現が、モデルの微妙な意味を捉える能力を抑制していると考えます。この問題に対処するため、我々はEmbedFilterを導入します。これは大規模言語モデルから直接得られるテキスト埋め込みを洗練するための単純な線形変換です。具体的には、大規模言語モデル内のアンエンベッディング行列が、これらの頻出トークンを埋め込み空間に積極的に書き込む潜在空間を符号化していることを明らかにします。この部分空間をフィルタリングすることにより、EmbedFilterは高頻度トークンの影響を抑制し、意味表現を強化します。魅力的な副産物として、これにより本質的な次元削減が可能となり、インデックス保存の低減と検索の高速化を実現しつつ、洗練された埋め込み品質を完全に維持します。複数の大規模言語モデルバックボーンを用いた実験により、EmbedFilterを備えた大規模言語モデルは、埋め込み次元を大幅に削減した場合でも、優れたゼロショット下流性能を達成することを示します。我々の知見が、大規模言語モデルに基づく表現のメカニズムについてより深い洞察を提供し、テキスト埋め込み学習を改善するためのより原理的な設計を促進することを期待します。コードはhttps://github.com/CentreChen/EmbFilterで公開しています。

English

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.