MindWatcher:迈向更智能的多模态工具集成推理
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
December 29, 2025
作者: Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang, Wei Wang, Hongfu Yang, Jiqing Zhan, Chunpeng Zhou, Zheng Zhou, Hao Ma, Tao Wei, Pan Zhou, Wei Chen
cs.AI
摘要
传统基于工作流的智能体在处理需要调用工具的现实问题时表现出有限的智能水平。能够自主推理并调用工具的工具集成推理(TIR)智能体正迅速崛起,成为解决涉及多步外部环境交互的复杂决策任务的有效方法。本研究提出MindWatcher——一种融合交错思维与多模态思维链(CoT)推理的TIR智能体。该智能体可自主决策是否及如何调用多样化工具并协调其使用,无需依赖人工提示或预设工作流。交错思维范式使模型能在任意中间阶段切换思考与工具调用,而其多模态CoT能力支持在推理过程中操作图像以获得更精准的搜索结果。我们构建了自动化数据审计与评估流程,辅以人工标注的高质量训练数据集,并建立了MindWatcher评估基准(MWE-Bench)以系统评估其性能。MindWatcher配备全套辅助推理工具,使其能应对广域多模态问题。一个涵盖汽车、动物、植物等八大类的大规模高质量本地图像检索数据库,赋予小规模模型强大的物体识别能力。最后,我们设计了更高效的训练基础设施,显著提升训练速度与硬件利用率。实验结果表明,MindWatcher不仅通过卓越的工具调用能力达到或超越更大规模或更新模型的表现,更揭示了智能体训练的关键发现(如智能体强化学习中的遗传继承现象)。
English
Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.