ROVER：面向全模态生成的互逆跨模态推理基准评测

摘要

统一多模态模型已成为无缝融合文本与图像理解及生成能力的重要范式。然而现行评估方法往往孤立地对待这些能力，导致多模态输入输出任务主要通过单模态推理进行评分——文本基准侧重语言推理，而视觉基准关注像素层面的推理结果。为此我们提出ROVER基准，旨在应对测试双向跨模态推理的迫切需求，这种利用一种模态引导、验证或优化另一模态输出的能力，是实现统一多模态智能愿景的核心。ROVER作为人工标注的基准数据集，专门针对双向跨模态推理设计，包含基于1876张图像的1312项任务，涵盖两种互补场景：面向视觉生成的语言增强推理评估模型能否利用文本提示和推理链指导精确的图像合成，面向语言生成的视觉增强推理检验模型能否通过生成中间可视化来强化问答任务的推理过程。通过对17个统一模型的实验，我们获得两个关键发现：(i)跨模态推理决定视觉生成质量，交错式模型显著优于非交错式模型，值得注意的是，单纯组合强单模态模型无法实现可比推理能力；(ii)模型在物理推理与符号推理间存在割裂：能成功解读具象概念却难以构建符号任务的视觉抽象，错误推理会损害性能。这些结果表明双向跨模态推理是实现真正全模态生成能力的关键前沿。

English

Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

ROVER：面向全模态生成的互逆跨模态推理基准评测

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

摘要

Support