何处参与：基于抛物线原理的视觉中心位置编码方法

摘要

我们提出抛物线位置编码（PaPE），一种基于抛物线函数、专为基于注意力架构的视觉模态设计的位置编码方法。针对图像、点云、视频或事件相机流等视觉标记集合，我们的目标是在编码其位置信息时充分考虑视觉模态的特性。现有研究大多将语言模型中一维序列的位置编码扩展至视觉领域的n维结构，但仅部分兼顾了视觉特性。我们通过从既往研究中提炼的设计原则来填补这一空白：平移不变性、旋转不变性（PaPE-RI）、距离衰减性、方向感知性和上下文感知性。我们在涵盖4种模态的8个数据集上评估PaPE，发现PaPE或PaPE-RI在7个数据集上均取得最优性能。ImageNet-1K上的外推实验表明，PaPE具有卓越的外推能力，其绝对性能较次优位置编码最高可提升10.5%。代码已发布于https://github.com/DTU-PAS/parabolic-position-encoding。

English

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

何处参与：基于抛物线原理的视觉中心位置编码方法

Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

摘要

Support