RealUnify: 통합 모델은 정말로 통합으로부터 이점을 얻는가? 포괄적인 벤치마크

초록

시각적 이해와 생성을 통합된 다중모달 모델로 통합하는 것은 범용 AI를 향한 중요한 진전을 나타냅니다. 그러나 기존 벤치마크는 이러한 아키텍처 통합이 실제로 구성 요소 능력 간의 시너지적 상호작용을 가능하게 하는지에 대한 근본적인 질문에 답하지 못하고 있습니다. 이해와 생성을 주로 개별적으로 평가하는 기존 평가 패러다임은 통합 모델이 이해를 활용하여 생성을 개선하거나, 생성적 시뮬레이션을 통해 더 깊은 이해를 촉진할 수 있는지를 판단하기에는 부족합니다. 이러한 중요한 격차를 해결하기 위해, 우리는 양방향 능력 시너지를 평가하기 위해 특별히 설계된 벤치마크인 RealUnify를 소개합니다. RealUnify는 10개 범주와 32개 하위 작업에 걸쳐 1,000개의 세심하게 인간이 주석을 단 인스턴스로 구성됩니다. 이는 두 가지 핵심 축을 중심으로 구조화되어 있습니다: 1) 이해가 생성을 강화하는 경우(예: 상식, 논리를 통해 이미지 생성을 안내하는 것)와 2) 생성이 이해를 강화하는 경우(예: 변형되거나 무질서한 시각적 입력을 정신적으로 시뮬레이션하거나 재구성하여 추론 작업을 해결하는 것). 주요 기여는 직접적인 종단간 평가와 작업을 별도의 이해 및 생성 단계로 분해하는 진단적 단계별 평가를 결합한 이중 평가 프로토콜입니다. 이 프로토콜을 통해 우리는 성능 병목 현상이 핵심 능력의 결함에서 비롯된 것인지, 아니면 이를 통합하지 못한 데서 비롯된 것인지를 정확히 파악할 수 있습니다. 12개의 주요 통합 모델과 6개의 전문 베이스라인에 대한 대규모 평가를 통해, 현재의 통합 모델들은 여전히 효과적인 시너지를 달성하는 데 어려움을 겪고 있으며, 이는 아키텍처 통합만으로는 충분하지 않음을 나타냅니다. 이러한 결과는 통합 모델링의 잠재력을 완전히 발휘하기 위해 새로운 훈련 전략과 귀납적 편향이 필요함을 강조합니다.

English

The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

RealUnify: 통합 모델은 정말로 통합으로부터 이점을 얻는가? 포괄적인 벤치마크

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

초록

Support