你的解嵌入矩陣其實是文本嵌入的特徵透鏡

摘要

大型语言模型在各类下游任务中展现出令人瞩目的零样本能力。然而，作为现成的嵌入模型使用时，它们仍存在局限，导致在大规模文本嵌入基准测试中表现欠佳。本文中，我们识别出这一缺陷的潜在成因。我们的动机源于一个意外发现：当文本嵌入投影到词汇空间时，往往会与高频但信息量有限的词汇标记趋于一致。我们认为，这种对高频标记的过度表达抑制了模型捕捉细微语义的能力。为解决这一问题，我们提出EmbedFilter——一种简单的线性变换方法，旨在直接优化从大语言模型中提取的文本嵌入。具体而言，我们发现大语言模型内部的解嵌入矩阵编码了一个潜在空间，该空间主动将这些高频标记写入嵌入空间。通过过滤这一子空间，EmbedFilter能够抑制高频标记的影响，从而增强语义表征。作为一项引人注目的副产品，这实现了固有的降维特性，可降低索引存储成本、加速检索过程，同时完整保留优化后的嵌入质量。我们在多个大语言模型主干上的实验表明，即使嵌入维度显著降低，配备EmbedFilter的模型仍能取得更优的零样本下游性能。我们希望这些发现能为基于大语言模型的表征机制提供更深入的理解，并启发更严谨的文本嵌入训练设计。我们的代码已开源，可在 https://github.com/CentreChen/EmbFilter 获取。

English

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.