SeqPE:采用序列位置编码的Transformer模型
SeqPE: Transformer with Sequential Position Encoding
June 16, 2025
作者: Huyang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
cs.AI
摘要
由于Transformer中的自注意力层在设计上具有排列不变性,必须显式地引入位置编码以实现空间理解。然而,传统的可学习位置嵌入(PE)中使用的固定大小查找表限制了在预训练序列长度之外的泛化能力。专家设计的方法如ALiBi和RoPE虽缓解了这一限制,但在适应新模态时需要进行大量修改,凸显了适应性和可扩展性方面的根本挑战。在本研究中,我们提出了SeqPE,一个统一且完全可学习的位置编码框架,它将每个n维位置索引表示为符号序列,并采用轻量级序列位置编码器以端到端的方式学习其嵌入。为了规范SeqPE的嵌入空间,我们引入了两个互补的目标:一个对比目标,使嵌入距离与预定义的位置距离函数对齐;以及一个知识蒸馏损失,将分布外位置嵌入锚定到分布内教师表示上,进一步增强外推性能。在语言建模、长上下文问答和二维图像分类等任务上的实验表明,SeqPE不仅在困惑度、精确匹配(EM)和准确率上超越了强基线——特别是在上下文长度外推情况下——而且无需手动架构重设计即可无缝泛化到多维输入。我们在https://github.com/ghrua/seqpe发布了代码、数据和检查点。
English
Since self-attention layers in Transformers are permutation invariant by
design, positional encodings must be explicitly incorporated to enable spatial
understanding. However, fixed-size lookup tables used in traditional learnable
position embeddings (PEs) limit extrapolation capabilities beyond pre-trained
sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this
limitation but demand extensive modifications for adapting to new modalities,
underscoring fundamental challenges in adaptability and scalability. In this
work, we present SeqPE, a unified and fully learnable position encoding
framework that represents each n-dimensional position index as a symbolic
sequence and employs a lightweight sequential position encoder to learn their
embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we
introduce two complementary objectives: a contrastive objective that aligns
embedding distances with a predefined position-distance function, and a
knowledge distillation loss that anchors out-of-distribution position
embeddings to in-distribution teacher representations, further enhancing
extrapolation performance. Experiments across language modeling, long-context
question answering, and 2D image classification demonstrate that SeqPE not only
surpasses strong baselines in perplexity, exact match (EM), and
accuracy--particularly under context length extrapolation--but also enables
seamless generalization to multi-dimensional inputs without requiring manual
architectural redesign. We release our code, data, and checkpoints at
https://github.com/ghrua/seqpe.