等变图像建模

摘要

当前生成模型，如自回归和扩散方法，将高维数据分布学习分解为一系列较简单的子任务。然而，在联合优化这些子任务时会出现内在冲突，现有解决方案无法在不牺牲效率或可扩展性的情况下解决此类冲突。我们提出了一种新颖的等变图像建模框架，通过利用自然视觉信号的平移不变性，从根本上对齐子任务间的优化目标。我们的方法引入了（1）增强水平轴平移对称性的列式标记化，以及（2）确保跨位置上下文关系一致性的窗口化因果注意力机制。在256x256分辨率的类别条件ImageNet生成任务上评估，我们的方法在减少计算资源使用的同时，实现了与最先进自回归模型相当的性能。系统分析表明，增强的等变性减少了任务间冲突，显著提升了零样本泛化能力，并支持超长图像合成。本研究首次建立了生成模型中任务对齐分解的框架，为高效参数共享和无冲突优化提供了洞见。代码和模型已公开于https://github.com/drx-code/EquivariantModeling。

English

Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.

等变图像建模

Equivariant Image Modeling

摘要

Support