UniPath: 통합 다중모드 추론을 위한 이해와 생성의 적응적 조정

초록

통합 멀티모달 모델(UMM)은 하나의 아키텍처 내에서 이해와 생성을 통합하는 것을 목표로 한다. 그러나 이 두 능력을 효과적으로 조정하여 보다 효율적이고 효과적인 추론을 수행하는 방법은 아직 충분히 탐구되지 않았다. 기존의 조정 접근법은 훈련 중에 결합을 수행하여 명시적인 추론 시간 조정 없이 진행하거나, 모든 입력에 대해 고정된 조정 패턴을 강제한다. 본 연구에서는 멀티모달 작업이 상당한 조정 경로 다양성을 보임을 입증한다. 즉, 서로 다른 입력이 서로 다른 조정 경로를 선호한다. 이는 이러한 다양성을 활용하는 것이 성능 향상의 핵심임을 시사한다. 우리는 조정 경로 다양성을 적응적으로 모델링하고 활용하기 위한 프레임워크인 UniPath를 제안한다. 단일 조정 패턴을 강제하는 대신, 작업 해결을 직접 응답, 텍스트 추론, 시각적 사고 구성, 가설 기반 탐색에 이르는 경로의 선택 및 실행으로 표현한다. 역할 정렬 궤적을 구성하여 경로 조건 실행기를 훈련하고, 경량 계획기 메커니즘을 도입하여 입력 의존적 경로 선택을 가능하게 한다. 실험 결과, 조정 경로 다양성을 활용하면 고정된 조정 전략에 비해 성능이 향상됨과 동시에 해석 가능한 중간 행동을 제공함을 보여준다. 코드는 다음에서 확인할 수 있다: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/unipath.

English

Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these two capabilities for more effective and efficient reasoning. Existing coordination approaches either perform coupling during training, without explicit inference-time coordination, or impose a fixed coordination pattern for all inputs. In this work, we show that multimodal tasks exhibit substantial coordination-path diversity: different inputs favor different coordination paths. This suggests that exploiting such diversity is key to improving performance. We propose UniPath, a framework for adaptively modeling and exploiting coordination-path diversity. Instead of enforcing a single coordination pattern, we represent task solving as the selection and execution of a path, ranging from direct answering to textual inference, visual-thought construction, and hypothesis-based exploration. We construct role-aligned trajectories to train a path-conditioned executor and introduce a lightweight planner mechanism to enable input-dependent path selection. Experiments show that leveraging coordination-path diversity improves performance over fixed coordination strategies while providing interpretable intermediate behaviors. The code is available at:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/unipath.

UniPath: 통합 다중모드 추론을 위한 이해와 생성의 적응적 조정

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

초록

Support