통합 멀티모달 모델을 위한 의미적 생성 튜닝

초록

통합 다중모드 모델(UMMs)은 단일 아키텍처 내에서 시각적 이해와 시각적 생성을 통합하는 것을 목표로 한다. 그러나 기존의 훈련 패러다임은 희소 텍스트 신호를 통한 이해와 밀집 픽셀 목표를 통한 생성을 독립적으로 최적화한다. 이러한 분리된 전략은 정렬되지 않은 표현 공간을 초래하여 시각적 이해와 생성을 격리시키고 상호 강화를 저해한다. 본 연구는 생성적 사후 훈련에 대한 최초의 체계적 조사를 제시하며, 계층적 시각 작업을 생성적 프록시로 정식화하여 UMMs의 이러한 격리를 해소한다. 실증적 조사 결과, 고수준 의미 작업, 특히 이미지 분할이 최적의 프록시 역할을 하는 것으로 나타났다. 저수준 작업이 질감 세부사항으로 모델을 산만하게 하는 반면, 분할은 구조적 의미를 제공하여 시각 중심 인지와 생성적 레이아웃 충실도를 모두 현저히 향상시킨다. 이러한 통찰을 바탕으로, 우리는 분할을 생성적 프록시로 활용하여 다중모드 기능을 정렬하고 시너지를 창출하는 새로운 패러다임인 의미 생성적 튜닝(SGT)을 소개한다. 기계적 분석은 SGT가 특징의 선형 분리 가능성을 근본적으로 개선하고 시각-텍스트 주의 할당 패턴을 최적화함을 추가로 입증한다. 광범위한 평가 결과, SGT가 주류 벤치마크 전반에서 다중모드 이해와 생성 충실도를 일관되게 향상시키는 것으로 나타났다. 코드는 https://song2yu.github.io/SGT/에서 확인할 수 있다.

English

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.