LMK > CLS: 高密度埋め込みのためのランドマークプーリング

要旨

表現学習は、検索、クラスタリング、分類、再ランキングなどの多くの下流タスクにおいて中心的な役割を果たしている。最先端のシーケンスエンコーダは通常、プーリング演算子を用いて可変長のトークンシーケンスを単一のベクトルに集約する。最も一般的な方法は、特殊な[CLS]トークンを用いる方法、またはトークン埋め込みの平均プーリングである。本論文では、これらのプーリング戦略の体系的弱点を特定する。[CLS]トークンは情報をシーケンスの前方位置に集中させがちで、分散した証拠を十分に表現できない可能性がある。一方、平均プーリングは顕著な局所的特徴を希薄化し、短い文脈タスクでの性能低下を招くことがある。これらの問題を解決するため、我々はLandmark（LMK）プーリングを提案する。この手法は、シーケンスをチャンクに分割し、チャンク間に目印トークン（Landmark Token）を挿入し、最終的な表現をこれらの目印トークンの埋め込みの平均プーリングによって形成する。この単純なメカニズムは、少数の特殊トークンを追加するコストはあるが、顕著な局所的特徴を犠牲にすることなく、長文脈への外挿性能を向上させる。実験により、LMKプーリングが短文脈の検索タスクでは既存手法と同等の性能を発揮し、長文脈タスクでは大幅な改善をもたらすことを実証する。これにより、LMKプーリングは既存のプーリング手法に対する実用的かつスケーラブルな代替手段となる。

English

Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

LMK > CLS: 高密度埋め込みのためのランドマークプーリング

LMK > CLS: Landmark Pooling for Dense Embeddings

要旨

Support