SeqPE: 逐次位置符号化を備えたトランスフォーマー

要旨

Transformerの自己注意層は設計上、順列不変であるため、空間理解を可能にするために位置エンコーディングを明示的に組み込む必要がある。しかし、従来の学習可能な位置埋め込み（PE）で使用される固定サイズのルックアップテーブルは、事前学習されたシーケンス長を超える外挿能力を制限する。ALiBiやRoPEなどの専門家が設計した手法はこの制限を緩和するが、新しいモダリティに適応するために大規模な変更を必要とし、適応性とスケーラビリティにおける根本的な課題を浮き彫りにする。本研究では、SeqPEを提案する。SeqPEは、各n次元位置インデックスをシンボリックシーケンスとして表現し、軽量な逐次位置エンコーダを使用してそれらの埋め込みをエンドツーエンドで学習する、統一された完全学習可能な位置エンコーディングフレームワークである。SeqPEの埋め込み空間を正則化するために、2つの補完的な目的を導入する。1つは、埋め込み距離を事前定義された位置距離関数と整合させるコントラスティブ目的であり、もう1つは、分布外の位置埋め込みを分布内の教師表現に固定する知識蒸留損失であり、外挿性能をさらに向上させる。言語モデリング、長文脈質問応答、2D画像分類にわたる実験により、SeqPEが特に文脈長外挿下で、パープレキシティ、完全一致（EM）、精度において強力なベースラインを上回るだけでなく、手動でのアーキテクチャ再設計を必要とせずに多次元入力へのシームレスな一般化を可能にすることを示す。コード、データ、チェックポイントをhttps://github.com/ghrua/seqpeで公開する。

English

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each n-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.

SeqPE: 逐次位置符号化を備えたトランスフォーマー

SeqPE: Transformer with Sequential Position Encoding

要旨

Support