MindWatcher: Verso un ragionamento multimodale integrato negli strumenti più intelligente

Abstract

Gli agenti tradizionali basati su workflow mostrano capacità limitate nell'affrontare problemi del mondo reale che richiedono l'invocazione di strumenti. Gli agenti di ragionamento integrato con strumenti (TIR), capaci di ragionamento autonomo e invocazione di tool, stanno emergendo rapidamente come approccio potente per compiti decisionali complessi che implicano interazioni multi-step con ambienti esterni. In questo lavoro presentiamo MindWatcher, un agente TIR che integra un paradigma di pensiero interlacciato e ragionamento multimodale a catena del pensiero (CoT). MindWatcher può decidere autonomamente se e come invocare strumenti diversificati e coordinarne l'uso, senza dipendere da prompt umani o workflow predefiniti. Il paradigma di pensiero interlacciato consente al modello di alternare pensiero e invocazione di strumenti in qualsiasi fase intermedia, mentre la sua capacità CoT multimodale permette la manipolazione di immagini durante il ragionamento per ottenere risultati di ricerca più precisi. Implementiamo pipeline automatizzate di auditing e valutazione dei dati, affiancate da dataset di alta qualità curati manualmente per l'addestramento, e costruiamo un benchmark, denominato MindWatcher-Evaluate Bench (MWE-Bench), per valutarne le prestazioni. MindWatcher è dotato di una suite completa di strumenti di ragionamento ausiliari, consentendogli di affrontare problemi multimodali su domini ampi. Un database locale di retrieval di immagini su larga scala e di alta qualità, che copre otto categorie inclusi automobili, animali e piante, conferisce al modello un robusto riconoscimento di oggetti nonostante le sue dimensioni compatte. Infine, progettiamo un'infrastruttura di addestramento più efficiente per MindWatcher, migliorando velocità di training e utilizzazione hardware. Gli esperimenti dimostrano non solo che MindWatcher eguaglia o supera le prestazioni di modelli più grandi o recenti grazie a un'invocazione di strumenti superiore, ma rivelano anche insight cruciali per l'addestramento di agenti, come il fenomeno dell'ereditarietà genetica nel reinforcement learning agentico.

English

Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

MindWatcher: Verso un ragionamento multimodale integrato negli strumenti più intelligente

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Abstract

Support