LMK > CLS: 조밀 임베딩을 위한 랜드마크 풀링

초록

표현 학습은 검색, 클러스터링, 분류, 재순위화 등 많은 다운스트림 작업의 핵심입니다. 최신 시퀀스 인코더는 일반적으로 풀링 연산자, 가장 흔히 특수 [CLS] 토큰이나 토큰 임베딩의 평균 풀링을 사용하여 가변 길이 토큰 시퀀스를 단일 벡터로 축약합니다. 본 논문에서는 이러한 풀링 전략의 체계적인 약점을 지적합니다: [CLS]는 시퀀스의 초기 위치로 정보가 집중되는 경향이 있어 분산된 증거를 제대로 표현하지 못할 수 있으며, 평균 풀링은 두드러진 지역적 신호를 희석시켜 단문 컨텍스트 성능이 더 나빠지는 경우가 있습니다. 이러한 문제를 해결하기 위해 우리는 랜드마크(LMK) 풀링을 소개합니다. 이는 시퀀스를 청크로 분할하고 청크 사이에 랜드마크 토큰을 삽입한 후, 랜드마크 토큰 임베딩을 평균 풀링하여 최종 표현을 형성합니다. 이 간단한 메커니즘은 소수의 특수 토큰을 추가하는 비용으로 지역적 중요 특징을 희생하지 않으면서 장문 컨텍스트 외삽 능력을 향상시킵니다. 우리는 LMK 풀링이 단문 컨텍스트 검색 작업에서는 기존 방법과 성능이 비슷하면서 장문 컨텍스트 작업에서는 상당한 향상을 가져와, 기존 풀링 방법에 대한 실용적이고 확장 가능한 대안이 됨을 실증적으로 입증합니다.

English

Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

LMK > CLS: 조밀 임베딩을 위한 랜드마크 풀링

LMK > CLS: Landmark Pooling for Dense Embeddings

초록

Support