SenseNOVA-MARS:基於強化學習增強多模態智能推理與搜索能力
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
December 30, 2025
作者: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu
cs.AI
摘要
儘管視覺語言模型(VLMs)能透過智能推理解決複雜任務,但其能力仍主要受限於文本導向的思維鏈或孤立的工具調用。它們無法展現出類人的流暢性,將動態工具操作與連續推理無縫交織,尤其在需要協調外部工具(如搜尋與影像裁剪)的知識密集型與視覺複雜場景中。本研究提出SenseNova-MARS——一種新型多模態智能推理與搜尋框架,透過強化學習(RL)賦予VLMs交錯式視覺推理與工具使用能力。具體而言,SenseNova-MARS動態整合影像搜尋、文本搜尋與影像裁剪工具,以應對細粒度與知識密集型的視覺理解挑戰。在強化學習階段,我們提出批次歸一化群組序列策略優化(BN-GSPO)算法,以提升訓練穩定性,並增強模型調用工具與有效推理的能力。為全面評估智能VLMs在複雜視覺任務上的表現,我們建立HR-MMSearch基準數據集——首個由高解析度影像組成、包含知識密集型與搜尋驅動問題的搜尋導向基準。實驗表明,SenseNova-MARS在開源搜尋與細粒度影像理解基準上實現了最先進的性能。具體而言,在搜尋導向基準中,SenseNova-MARS-8B於MMSearch獲得67.84分,於HR-MMSearch獲得41.64分,超越Gemini-3-Flash與GPT-5等專有模型。SenseNova-MARS透過提供高效且穩健的工具使用能力,為智能VLMs的發展邁出關鍵一步。為推動相關研究,我們將公開所有程式碼、模型與數據集。
English
While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.