LMK > CLS：面向稠密嵌入的地标池化方法

摘要

表征学习是搜索、聚类、分类和重排等下游任务的核心。当前最先进的序列编码器通常通过池化操作将可变长度的标记序列压缩为单一向量，最常用的方法包括特殊[CLS]标记池化或标记嵌入的均值池化。本文揭示了这些池化策略的系统性缺陷：[CLS]池化倾向于将信息集中于序列起始位置，可能弱化分布式证据的表征能力；而均值池化则会稀释重要的局部信号，有时导致短上下文场景性能下降。为此，我们提出地标（LMK）池化方法：将序列分割为多个文本块，在块间插入地标标记，最后通过池化地标标记嵌入形成最终表征。这种简洁机制以引入少量特殊标记为代价，在保留局部显著特征的同时提升了长上下文外推能力。实验表明，LMK池化在短上下文检索任务中与现有方法表现相当，在长上下文任务中则实现显著提升，为现有池化方法提供了实用且可扩展的替代方案。

English

Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

LMK > CLS：面向稠密嵌入的地标池化方法

LMK > CLS: Landmark Pooling for Dense Embeddings

摘要

Support