重新排序圖像塊提升視覺模型效能

摘要

序列模型（如Transformer）要求输入以一维序列的形式表示。在视觉领域，这通常涉及使用固定的行优先（光栅扫描）顺序对图像进行展平。尽管完全的自注意力机制具有排列等变性，但现代的长序列Transformer越来越依赖于架构近似，这些近似打破了这种不变性，并引入了对补丁顺序的敏感性。我们展示了在这种设置下，补丁顺序显著影响模型性能，简单的替代方案（如列优先或希尔伯特曲线）会导致显著的准确率变化。受此启发，我们提出了REOrder，一个用于发现任务最优补丁顺序的两阶段框架。首先，我们通过评估各种补丁序列的可压缩性，推导出一个信息论先验。然后，我们通过使用REINFORCE优化Plackett-Luce策略，学习一个关于排列的策略。这种方法能够在组合排列空间中进行高效学习。REOrder在ImageNet-1K上相较于行优先顺序提高了最多3.01%的top-1准确率，在Functional Map of the World上提高了13.35%。

English

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

重新排序圖像塊提升視覺模型效能

REOrdering Patches Improves Vision Models

摘要

Support