병목 없는 통합 멀티모달 모델을 위한 표현 강제

초록

통합 다중모달 모델(UMMs)은 단일 모델에서 인식과 생성을 처리하는 것을 목표로 한다. 그러나 기존 UMMs는 이미지 생성을 위해 고정된, 별도로 사전 학습된 VAE에 여전히 의존하여 구조적 병목을 초래한다. 이를 단순히 제거하면 모델이 원시 픽셀로부터 고수준 구조와 저수준 세부 사항을 모두 학습해야 하므로 품질 격차가 발생한다. 본 논문에서는 표현 예측을 모델의 고유 기능으로 만들어 이 격차를 해소하는 기법인 Representation Forcing (RF)을 제안한다. 구체적으로, RF는 디코더가 픽셀 이전에 중간 토큰으로 시각적 표현을 자기회귀적으로 예측하도록 강제하며, 이 토큰들은 이후 동일한 백본 내에서 픽셀 확산을 안내하기 위해 컨텍스트에 유지된다. 인식 출력을 생성 대상으로 전환함으로써 RF는 외부 생성 잠재 공간의 필요성을 제거한다. RF는 이해와 생성 모두에 이점을 제공함을 확인했다. 이미지 생성에서, RF를 적용한 픽셀 공간 모델은 최첨단 VAE 기반 통합 모델과 성능이 같았다. 이미지 이해에서, 픽셀 공간 RF는 일반적으로 VAE 기반 변형보다 성능이 뛰어났다. 종합적으로, 이러한 결과는 종단 간, 병목 없는 UMMs를 향한 효과적인 단계를 제시한다.

English

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.