**参加する場所：放物線を用いた原理的視覚中心位置符号化**

要旨

本論文では、注意機構ベースのアーキテクチャにおける視覚モダリティのための放物線ベース位置符号化「Parabolic Position Encoding (PaPE)」を提案する。画像、点群、動画、イベントカメラストリームなどの視覚トークン群が与えられたとき、我々の目的は視覚モダリティの特性を考慮しながらそれらの位置を符号化することである。従来研究は、言語における1次元シーケンスの位置符号化を視覚におけるn次元構造に拡張してきたが、視覚特性を部分的にしか考慮していなかった。我々はこのギャップに対処するため、従来研究から抽出した原理―並進不変性、回転不変性（PaPE-RI）、距離減衰、方向性、文脈認識性―に基づいてPaPEを設計した。4つのモダリティにまたがる8つのデータセットでPaPEを評価した結果、PaPEまたはPaPE-RIが8データセット中7つで最高性能を達成した。ImageNet-1Kでの外挿実験では、PaPEが顕著な外挿性能を示し、次点の位置符号化に対して最大10.5%の絶対精度向上を達成した。コードはhttps://github.com/DTU-PAS/parabolic-position-encoding で公開されている。

English

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

参加する場所：放物線を用いた原理的視覚中心位置符号化

Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

要旨

Support