ChatPaper.aiChatPaper

表征对齐的关键:全局信息还是空间结构?

What matters for Representation Alignment: Global Information or Spatial Structure?

December 11, 2025
作者: Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, Saining Xie
cs.AI

摘要

表征对齐(REPA)通过将预训练强视觉编码器的表征蒸馏至扩散模型的中间特征来指导生成式训练。我们探究了一个根本性问题:目标表征的哪个维度对生成效果起关键作用——是其全局语义信息(例如通过ImageNet-1K准确率衡量)还是其空间结构(即图像块标记间的成对余弦相似度)?普遍观点认为,作为目标表征时,越强的全局语义性能会带来越好的生成效果。为验证此观点,我们首先对27种不同视觉编码器及不同模型规模进行大规模实证分析。结果出人意料:驱动目标表征生成性能的关键因素是空间结构而非全局性能。为进一步研究,我们引入两种直接改进方案,专门强化空间信息的传递:将REPA中的标准MLP投影层替换为简单卷积层,并为外部表征引入空间归一化层。令人惊讶的是,我们这个被命名为iREPA的简易方法(实现代码不足4行),在多种视觉编码器、模型规模和训练变体(如REPA、REPA-E、Meanflow、JiT等)上均能持续提升REPA的收敛速度。本研究促使我们重新审视表征对齐的基本工作机制,以及如何利用该机制改进生成模型的训练。代码及项目页面详见https://end2end-diffusion.github.io/irepa。
English
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its global semantic information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of spatial information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in <4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
PDF52December 17, 2025