等変性画像モデリング

要旨

現在の生成モデル、例えば自己回帰モデルや拡散モデルは、高次元データ分布の学習を一連のより単純なサブタスクに分解します。しかし、これらのサブタスクを同時に最適化する際に本質的な衝突が生じ、既存の解決策では効率性やスケーラビリティを犠牲にすることなくそのような衝突を解決できません。本研究では、自然な視覚信号の並進不変性を活用することで、サブタスク間の最適化目標を本質的に整合させる新しい等変画像モデリングフレームワークを提案します。我々の手法は、(1)水平軸に沿った並進対称性を強化する列単位のトークン化、および(2)位置間の一貫した文脈的関係を強制するウィンドウ化された因果的注意機構を導入します。256x256解像度のクラス条件付きImageNet生成において評価した結果、我々のアプローチは最先端の自己回帰モデルに匹敵する性能を達成しつつ、より少ない計算リソースを使用します。体系的な分析により、強化された等変性がタスク間の衝突を軽減し、ゼロショット汎化を大幅に改善し、超長尺画像合成を可能にすることが示されました。本研究は、生成モデリングにおけるタスク整合分解の最初のフレームワークを確立し、効率的なパラメータ共有と衝突のない最適化に関する洞察を提供します。コードとモデルはhttps://github.com/drx-code/EquivariantModelingで公開されています。

English

Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.

等変性画像モデリング

Equivariant Image Modeling

要旨

Support