RepFusion: マルチモーダル事前分布を活用した表現空間でのノイズ除去

要旨

大規模言語モデル（LLM）はテキストから画像を生成する（T2I）システムで広く利用されているが、通常はテキスト符号化に限定され、ノイズ除去は新たに学習された生成バックボーンが担っている。表現オートエンコーダー（RAE）の登場により、生成目標は意味的に構造化された視覚表現へと移行し、事前学習済みLLMの事前分布とより親和性の高い潜在空間が創出される。マルチモーダルLLM（MLLM）においては、MLPプロジェクターがクリーンな視覚表現を事前学習済みLLMと整合させるのに十分であるという知見に着想を得て、我々はこのメカニズムをクリーンな入力からノイズを含む入力へと拡張し、MLLM自体をノイズ表現エンコーダーとして再利用する。本稿では、得られたMLLM出力を拡散トランスフォーマーの条件付け信号として用いるRepFusionを提案する。同等の推論予算での統制比較において、RepFusionは新たに初期化されたノイズ除去器に同等の容量を割り当てたベースラインを上回る性能を示した。これらの結果は、MLLMが視覚表現のノイズ除去に対して強力な事前分布を提供すること、そして進化するノイズ表現に条件付けすることで、現代のT2Iシステムにおいてテスト時計算を反復的なMLLM条件付けに有効に配分できることを実証している。

English

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.