ChatPaper.aiChatPaper

ROVER:面向全模态生成的逆向跨模态推理基准评测

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

November 3, 2025
作者: Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang
cs.AI

摘要

统一多模态模型(UMMs)已成为无缝整合文本与图像理解与生成的强大范式。然而主流评估方法仍孤立对待这些能力——多模态输入输出的任务主要通过单模态推理进行评分:文本基准强调基于语言的推理,而视觉基准关注像素层面呈现的推理结果。我们提出ROVER基准来解决这一迫切需求,该基准专门测试交互式跨模态推理能力(即利用一种模态引导、验证或优化另一种模态的输出),这种能力是实现统一多模态智能愿景的核心。ROVER作为人工标注的基准数据集,明确针对交互式跨模态推理设计,包含基于1876张图像的1312项任务,涵盖两种互补场景:面向视觉生成的语言增强推理评估模型能否利用语言提示和推理链指导精准的图像合成,面向语言生成的视觉增强推理检验模型能否生成中间可视化结果以强化问答任务的推理过程。对17个统一模型的实验揭示两大关键发现:(一)跨模态推理决定视觉生成质量,交错式模型显著优于非交错式模型,值得注意的是,单纯组合强单模态模型无法实现可比推理能力;(二)模型在物理推理与符号推理间存在割裂:能成功解读具象感知概念,却难以构建符号化任务的视觉抽象表征,错误推理会损害性能。这些结果表明交互式跨模态推理是实现真正全模态生成的关键前沿领域。
English
Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.
PDF311January 19, 2026