SenseNOVA-MARS:通过强化学习赋能多模态智能推理与搜索
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
December 30, 2025
作者: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu
cs.AI
摘要
尽管视觉语言模型(VLMs)能够通过智能推理解决复杂任务,但其能力仍主要局限于文本导向的思维链或孤立工具调用。它们无法展现人类般的熟练度,将动态工具操作与连续推理无缝交织,尤其在需要协调外部工具(如搜索和图像裁剪)的知识密集型及视觉复杂场景中。本文提出SenseNova-MARS——一种新型多模态智能推理与搜索框架,通过强化学习赋予VLMs交织式视觉推理与工具调用能力。具体而言,SenseNova-MARS动态整合图像搜索、文本搜索与图像裁剪工具,以应对细粒度和知识密集型的视觉理解挑战。在强化学习阶段,我们提出批归一化分组序列策略优化算法(BN-GSPO),以提升训练稳定性并增强模型调用工具与有效推理的能力。为全面评估智能VLM在复杂视觉任务中的表现,我们构建了HR-MMSearch基准——首个由高分辨率图像构成、包含知识密集型搜索驱动问题的搜索导向基准。实验表明,SenseNova-MARS在开源搜索与细粒度图像理解基准上达到最先进性能。具体而言,在搜索导向基准上,SenseNova-MARS-8B模型在MMSearch得分67.84,在HR-MMSearch得分41.64,超越Gemini-3-Flash、GPT-5等专有模型。SenseNova-MARS通过提供高效稳健的工具调用能力,为智能VLM的发展迈出重要一步。为推动该领域研究,我们将公开全部代码、模型与数据集。
English
While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.