RepFusion: 利用多模态先验在表示空间中进行去噪

摘要

大型语言模型（LLMs）被广泛应用于文本到图像（T2I）系统，但通常仅用于文本编码，而去噪过程则由新训练的生成主干网络处理。表示自编码器（RAEs）的出现将生成目标转向语义结构化的视觉表示，从而构建出与预训练LLM先验更兼容的潜在空间。受多模态大语言模型（MLLMs）启发——其中仅需一个MLP投影器即可将干净视觉表示与预训练LLM对齐——我们将MLLM本身重新用作噪声表示编码器，将这一机制从干净输入扩展到含噪输入。我们提出RepFusion，利用由此生成的MLLM输出作为扩散变换器的条件信号。在相似推理预算下的控制对比实验中，RepFusion在性能上优于将相当容量分配给新初始化解码器的基线方法。这些结果表明，MLLM为视觉表示的去噪提供了强先验，并且通过以演化中的噪声表示为条件，现代T2I系统可以将测试时的计算资源有效投入到重复的MLLM条件处理中。

English

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.