重新排列图像块提升视觉模型性能

摘要

诸如Transformer之类的序列模型要求输入以一维序列的形式表示。在视觉领域，这通常涉及使用固定的行优先（光栅扫描）顺序将图像展平。尽管完全自注意力机制具有排列等变性，但现代长序列Transformer越来越依赖于打破这种不变性并引入对补丁顺序敏感性的架构近似。我们证明，在这种设置下，补丁顺序显著影响模型性能，而简单的替代方案（如列优先或希尔伯特曲线）会导致显著的准确率变化。受此启发，我们提出了REOrder，一个用于发现任务最优补丁顺序的两阶段框架。首先，我们通过评估各种补丁序列的可压缩性来推导信息论先验。然后，我们通过使用REINFORCE优化Plackett-Luce策略来学习一个排列策略。这种方法能够在组合排列空间中进行高效学习。REOrder在ImageNet-1K上相比行优先顺序将top-1准确率提高了最多3.01%，在Functional Map of the World上提高了13.35%。

English

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

重新排列图像块提升视觉模型性能

REOrdering Patches Improves Vision Models

摘要

Support