パッチの並べ替えが視覚モデルを改善する

要旨

トランスフォーマーなどのシーケンスモデルでは、入力が1次元のシーケンスとして表現される必要があります。画像処理において、これは通常、固定の行優先（ラスタースキャン）順序で画像を平坦化することを意味します。完全な自己注意機構は順序不変性を持ちますが、現代の長シーケンストランスフォーマーでは、この不変性を破り、パッチの順序に対する感度を導入するアーキテクチャの近似がますます使用されています。本論文では、パッチの順序がこのような設定においてモデルの性能に大きな影響を与えることを示し、列優先順序やヒルベルト曲線などの単純な代替案が顕著な精度の変化をもたらすことを明らかにします。これに動機づけられて、タスク最適なパッチ順序を発見するための2段階フレームワークであるREOrderを提案します。まず、さまざまなパッチシーケンスの圧縮性を評価することで、情報理論的な事前分布を導出します。次に、REINFORCEを使用してPlackett-Luceポリシーを最適化することで、順列に対するポリシーを学習します。このアプローチにより、組み合わせ順列空間での効率的な学習が可能になります。REOrderは、ImageNet-1Kにおいて行優先順序よりも最大3.01%、Functional Map of the Worldにおいて13.35%のトップ1精度の向上を実現します。

English

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

パッチの並べ替えが視覚モデルを改善する

REOrdering Patches Improves Vision Models

要旨

Support