InSight-o3:为多模态基础模型赋予广义视觉搜索能力
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
December 21, 2025
作者: Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang
cs.AI
摘要
AI代理实现“图像化思考”需要推理能力与感知能力的深度融合。然而当前开源的 multimodal agent 在推理能力方面仍存在明显不足,而这种能力对于现实任务(如分析含密集图表/示意图的文档、地图导航等)至关重要。为弥补这一缺陷,我们推出O3-Bench——一个专为评估交织视觉细节关注的多模态推理能力而设计的新基准。该基准包含一系列挑战性问题,要求代理通过多步推理整合来自图像不同区域的细微视觉信息。即便对OpenAI o3等前沿系统而言,这些问题也极具挑战性,其在O3-Bench上的准确率仅为40.8%。为推动进展,我们提出InSight-o3多智能体框架,包含视觉推理代理(vReasoner)和视觉搜索代理(vSearcher),并针对后者提出广义视觉搜索任务——超越自然图像中简单物体或图形的定位,实现基于自由语言描述的关系型、模糊性或概念性区域定位。我们进一步通过强化学习训练出专用于此任务的多模态大语言模型。作为即插即用模块,vSearcher能够赋能前沿多模态模型(作为vReasoner),显著提升其在各类基准测试中的表现。这标志着向构建强大o3级开源系统迈出实质性一步。代码与数据集详见:https://github.com/m-Just/InSight-o3。
English
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .