패치 재정렬이 비전 모델의 성능을 향상시킨다

초록

트랜스포머와 같은 시퀀스 모델은 입력이 1차원 시퀀스로 표현되어야 합니다. 비전 분야에서는 일반적으로 고정된 행 우선(래스터 스캔) 순서를 사용하여 이미지를 평면화합니다. 완전한 자기 주의(self-attention)는 순열 등변성(permutation-equivariant)을 가지지만, 현대의 장시퀀스 트랜스포머는 점점 더 이러한 불변성을 깨고 패치 순서에 대한 민감도를 도입하는 아키텍처적 근사치에 의존하고 있습니다. 우리는 이러한 설정에서 패치 순서가 모델 성능에 상당한 영향을 미치며, 열 우선 순서나 힐베르트 곡선과 같은 간단한 대안들이 주목할 만한 정확도 변화를 가져온다는 것을 보여줍니다. 이를 바탕으로 우리는 작업에 최적화된 패치 순서를 발견하기 위한 두 단계 프레임워크인 REOrder를 제안합니다. 먼저, 다양한 패치 시퀀스의 압축 가능성을 평가하여 정보 이론적 사전 정보를 도출합니다. 그런 다음, REINFORCE를 사용하여 Plackett-Luce 정책을 최적화함으로써 순열에 대한 정책을 학습합니다. 이 접근법은 조합 순열 공간에서 효율적인 학습을 가능하게 합니다. REOrder는 ImageNet-1K에서 행 우선 순서 대비 최대 3.01%, Functional Map of the World에서는 13.35%의 상위 1 정확도 향상을 달성합니다.

English

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

패치 재정렬이 비전 모델의 성능을 향상시킨다

REOrdering Patches Improves Vision Models

초록

Support