RepFusion: Het benutten van multimodale voorkennis voor denoising in de representatieruimte

Samenvatting

Grote taalmodellen (LLM's) worden veelvuldig gebruikt in tekst-naar-beeld (T2I) systemen, maar zijn doorgaans beperkt tot tekstcodering, terwijl het denoising wordt uitgevoerd door nieuw getrainde generatieve backbones. De opkomst van representatie-autoencoders (RAE's) verschuift het generatiedoel naar semantisch gestructureerde visuele representaties, waardoor een latente ruimte ontstaat die compatibeler is met voorgetrainde LLM-priors. Inspiratie nemend uit multimodale LLM's (MLLM's), waarbij een MLP-projector volstaat om schone visuele representaties uit te lijnen met een voorgetraind LLM, hergebruiken we de MLLM zelf als een ruizige representatie-encoder, en breiden we dit mechanisme uit van schone naar ruizige invoer. We presenteren RepFusion, dat de resulterende MLLM-uitvoer gebruikt als conditionering voor een diffusietransformator. In gecontroleerde vergelijkingen met vergelijkbare inferentiebudgetten presteert RepFusion beter dan baselines die vergelijkbare capaciteit toewijzen aan nieuw geïnitialiseerde denoisers. Deze resultaten tonen aan dat MLLM's sterke priors bieden voor het denoising van visuele representaties en dat, door conditionering op evoluerende ruizige representaties, testtijdberekening productief kan worden besteed aan herhaalde MLLM-conditionering in moderne T2I-systemen.

English

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.