통합 멀티모달 모델에서 이해 감독을 통한 시각적 생성 제어

초록

통합 멀티모달 모델은 이해와 생성 간의 격차를 해소할 것으로 기대된다. 그러나 경쟁력 있는 성능을 달성하기 위해 최신 모델들은 대부분 이해 및 생성 구성 요소를 분리하여 채택한다. 이러한 설계는 개별 작업에는 효과적이지만, 상호 강화에 필요한 연결을 약화시켜 잠재적 시너지에 대한 경험적 확신을 불확실하게 만든다. 우리는 이해 중심 사후 훈련(UNO)을 도입하여 이러한 시너지를 명시적으로 복원할 것을 제안한다. UNO는 이해를 별개의 작업뿐만 아니라 생성적 표현을 유도하는 직접적인 감독 신호로 간주하는 경량 프레임워크이다. 의미적 추상화(캡셔닝)와 구조적 세부 사항(시각적 회귀)을 인코딩하는 목표를 통합함으로써, 우리는 이해에서 생성으로의 효과적인 그래디언트 흐름을 가능하게 한다. 이미지 생성 및 편집에 대한 광범위한 실험은 이해가 생성을 위한 효과적인 촉매제 역할을 할 수 있음을 보여준다.

English

Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.

통합 멀티모달 모델에서 이해 감독을 통한 시각적 생성 제어

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

초록

Support