ChatPaper.aiChatPaper

InSight-o3:以通用化視覺搜尋增強多模態基礎模型

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

December 21, 2025
作者: Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang
cs.AI

摘要

人工智慧代理要實現「以圖像思考」的能力,需要推理與感知的精密融合。然而,當前開源多模態代理在關鍵的推理能力上仍顯不足,尤其面對現實任務(如分析帶有密集圖表/圖示的文件或地圖導航)時更為明顯。為彌合這一差距,我們推出O3-Bench——一個專注於評估交錯關注視覺細節之多模態推理能力的新基準。該基準包含需要代理通過多步驟推理,從不同圖像區域拼湊細微視覺資訊的挑戰性問題。這些問題即使對OpenAI o3等前沿系統也極具挑戰性,其在O3-Bench上的準確率僅達40.8%。為推動進展,我們提出InSight-o3多代理框架,包含視覺推理代理(vReasoner)與視覺搜索代理(vSearcher),並針對後者提出「泛化視覺搜索」任務——超越自然圖像中簡單物件或圖形的定位,實現對自由語言描述之關聯性、模糊性或概念性區域的搜尋。我們進一步通過強化學習訓練出專為此任務設計的多模態大型語言模型。作為即插即用模組,vSearcher能增強前沿多模態模型(作為vReasoner),顯著提升其在多項基準測試中的表現。這標誌著我們在構建強大開源o3類系統方面邁出實質步伐。相關程式碼與資料集請參見:https://github.com/m-Just/InSight-o3。
English
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .
PDF61December 30, 2025