RepFusion: 다중 모달 사전 정보를 활용한 표현 공간 내 잡음 제거

초록

대규모 언어 모델(LLM)은 텍스트-이미지(T2I) 시스템에서 널리 사용되지만, 일반적으로 텍스트 인코딩에 국한되며 잡음 제거는 새로 학습된 생성 백본이 처리한다. 표현 오토인코더(RAE)의 등장은 생성 목표를 의미론적으로 구조화된 시각적 표현으로 전환하여 사전 학습된 LLM 사전 분포와 더 호환되는 잠재 공간을 만든다. 깨끗한 시각적 표현을 사전 학습된 LLM과 정렬하는 데 MLP 프로젝터만으로 충분한 멀티모달 LLM(MLLM)에서 영감을 받아, 우리는 이 메커니즘을 깨끗한 입력에서 잡음이 있는 입력으로 확장하여 MLLM 자체를 잡음이 있는 표현 인코더로 재활용한다. 우리는 결과 MLLM 출력을 디퓨전 트랜스포머의 조건 신호로 사용하는 RepFusion을 제시한다. 유사한 추론 예산 내에서 수행된 통제 비교에서, RepFusion은 새로 초기화된 잡음 제거기에 비슷한 용량을 할당한 기준 모델보다 우수한 성능을 보인다. 이러한 결과는 MLLM이 시각적 표현의 잡음 제거에 강력한 사전 분포를 제공하며, 진화하는 잡음이 있는 표현에 조건화함으로써 현대 T2I 시스템에서 반복적인 MLLM 조건화에 테스트 시간 계산을 생산적으로 사용할 수 있음을 입증한다.

English

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.