MindWatcher:邁向更智慧的多元模態工具整合推理
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
December 29, 2025
作者: Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang, Wei Wang, Hongfu Yang, Jiqing Zhan, Chunpeng Zhou, Zheng Zhou, Hao Ma, Tao Wei, Pan Zhou, Wei Chen
cs.AI
摘要
基於傳統工作流程的智能體在解決需要調用工具的現實問題時,往往表現出侷限性。具備自主推理與工具調用能力的工具整合推理(TIR)智能體,正迅速崛起為應對複雜決策任務的強效範式,這類任務通常涉及與外部環境的多步驟交互。本研究提出MindWatcher——一種融合交錯思考與多模態思維鏈(CoT)推理的TIR智能體。該系統能自主決策是否及如何調用多樣化工具並協調其使用,無需依賴人工提示或預設工作流。其交錯思考範式使模型能在任意中間階段靈活切換思考與工具調用,而多模態CoT能力則允許在推理過程中操作圖像以獲得更精準的搜索結果。我們構建了自動化數據審計與評估流程,輔以人工精標的高質量訓練數據集,並設立名為MindWatcher評估基準(MWE-Bench)的測試平台以系統評估其性能。MindWatcher配備了完備的輔助推理工具集,使其能處理跨領域的多模態問題。一個涵蓋汽車、動植物等八大類別的大規模高質量本地圖像檢索數據庫,賦予了小參數模型強健的物體識別能力。最後,我們為MindWatcher設計了更高效的訓練架構,顯著提升訓練速度與硬件利用率。實驗結果表明,MindWatcher不僅通過卓越的工具調用能力匹配甚至超越了規模更大或更新的模型,更揭示了智能體訓練的關鍵洞見(如智能體強化學習中的遺傳傳承現象)。
English
Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.