你的UnEmbedding矩阵其实是一个针对文本嵌入的特征透镜

摘要

大语言模型在各类下游任务中展现出令人瞩目的零样本能力。然而，它们在作为开箱即用的嵌入模型时仍存在不足，导致在大规模文本嵌入基准测试中性能欠佳。本文识别了造成这一缺陷的潜在原因。我们的动机源于一个意外发现：当文本嵌入被投影到词汇空间时，往往会偏向高频但信息量低的词元。我们认为，这种对高频词元的过度表达抑制了模型捕捉细微语义的能力。为解决这一问题，我们提出了EmbedFilter——一种简单的线性变换，旨在直接优化从大语言模型中提取的文本嵌入。具体而言，我们发现大语言模型中的去嵌入矩阵编码了一个潜在空间，该空间正积极将这些高频词元写入嵌入空间。通过过滤掉这一子空间，EmbedFilter抑制了高频词元的影响，从而增强语义表示。作为一项引人注目的副产品，这还实现了内在的降维，降低了索引存储并加速了检索，同时完全保留了优化后的嵌入质量。我们在多个大语言模型骨干上的实验表明，即使嵌入维度显著降低，配备EmbedFilter的大语言模型仍能实现更优的零样本下游性能。我们希望这些发现能为基于大语言模型的表示机制提供更深入的理解，并启发更原则性的设计以改进文本嵌入训练。我们的代码已开源在https://github.com/CentreChen/EmbFilter。

English

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.