RepFusion：利用多模態先驗在表徵空間中進行降噪

摘要

大型語言模型（LLM）廣泛應用於文字到影像（T2I）系統中，但其功能通常僅限於文字編碼，而去噪任務則由新訓練的生成骨架負責。表示自編碼器（RAE）的出現，將生成目標轉向具有語義結構的視覺表示，從而創造出與預訓練LLM先驗更相容的潛在空間。受到多模態LLM（MLLM）的啟發——在該架構中，僅需一個MLP投影器即可將乾淨的視覺表示與預訓練LLM對齊——我們將MLLM本身重新設計為雜訊表示編碼器，將此機制從乾淨輸入擴展至含雜訊輸入。我們提出RepFusion，該方法利用產生的MLLM輸出作為擴散轉換器的條件訊號。在相似推理預算下的受控比較中，RepFusion優於那些將同等容量分配給新初始化去噪器的基準方法。這些結果表明，MLLM為去噪視覺表示提供了強大的先驗，且透過條件化於動態變化的雜訊表示，測試時計算可高效地花費於現代T2I系統中反覆進行的MLLM條件化過程。

English

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.