LMK > CLS:面向稠密嵌入的地标池化方法
LMK > CLS: Landmark Pooling for Dense Embeddings
January 29, 2026
作者: Meet Doshi, Aashka Trivedi, Vishwajeet Kumar, Parul Awasthy, Yulong Li, Jaydeep Sen, Radu Florian, Sachindra Joshi
cs.AI
摘要
表征学习是搜索、聚类、分类和重排等下游任务的核心。当前最先进的序列编码器通常通过池化操作(最常用的是特殊[CLS]标记或词嵌入均值池化)将可变长度的标记序列压缩为单一向量。本文指出这些池化策略存在系统性缺陷:[CLS]倾向于将信息集中于序列起始位置,可能无法充分表征分布式证据;而均值池化则会稀释显著的局部信号,有时导致短上下文场景表现下降。为解决这些问题,我们提出地标(LMK)池化方法:将序列分割为文本块,在块间插入地标标记,最后通过对地标标记嵌入进行均值池化形成最终表征。这种简易机制能在不牺牲局部显著特征的前提下提升长上下文外推能力,仅需引入少量特殊标记作为代价。实证研究表明,LMK池化在短上下文检索任务中与现有方法表现相当,在长上下文任务中则实现显著提升,为现有池化方法提供了实用且可扩展的替代方案。
English
Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.