LMK > CLS: Landmark-pooling voor dichte embeddings

Samenvatting

Representatie-leren is essentieel voor vele downstreamtaken zoals zoeken, clustering, classificatie en herrangschikking. State-of-the-art sequentie-encoders comprimeren typisch een variabelengte-tokenreeks tot een enkele vector met behulp van een poolingoperator, meestal een speciaal [CLS]-token of mean pooling over token-embeddings. In dit artikel identificeren we systematische zwaktes van deze poolingstrategieën: [CLS] neigt informatie te concentreren naar de beginposities van de reeks en kan gedistribueerd bewijs ondervertegenwoordigen, terwijl mean pooling salientie lokale signalen kan verdunnen, wat soms leidt tot slechtere kort-contextprestaties. Om deze problemen aan te pakken, introduceren we Landmark (LMK) pooling, die een reeks opdeelt in segmenten, landmark-tokens tussen segmenten invoegt en de uiteindelijke representatie vormt door mean pooling toe te passen op de landmark-token-embeddings. Dit eenvoudige mechanisme verbetert lang-contextextrapolatie zonder in te boeten aan lokale salientie kenmerken, ten koste van het introduceren van een klein aantal speciale tokens. We tonen empirisch aan dat LMK pooling evenaart met bestaande methoden voor kort-contextretrievaltaken en aanzienlijke verbeteringen oplevert voor lang-contexttaken, waardoor het een praktisch en schaalbaar alternatief vormt voor bestaande poolingmethoden.

English

Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

LMK > CLS: Landmark-pooling voor dichte embeddings

LMK > CLS: Landmark Pooling for Dense Embeddings

Samenvatting

Support