MuSEAgent: 状態保持型経験を持つマルチモーダル推論エージェント

要旨

近年、研究エージェントは、異種のテキスト・視覚情報源にわたる情報探索と統合において著しい進展を遂げている。本論文では、研究エージェントの能力を拡張し、状態を保持する経験（Stateful Experience）を発見・活用することで意思決定を強化する、マルチモーダル推論エージェントMuSEAgentを提案する。軌道レベルの検索に依存するのではなく、インタラクションデータを後顧推論を通じて原子的な意思決定経験へと抽象化する、状態を保持する経験学習パラダイムを提案する。これらの経験は品質フィルタリングされた経験バンクに編成され、推論時にポリシー駆動による経験検索をサポートする。具体的には、MuSEAgentは、補完的な広域検索と深域検索の戦略を通じて適応的な経験活用を可能にし、多様な合成的意味的視点にわたってマルチモーダルなガイダンスを動的に検索できるようにする。大規模な実験により、MuSEAgentが、細粒度の視覚知覚タスクと複雑なマルチモーダル推論タスクの両方において、強力な軌道レベル経験検索ベースラインを一貫して凌駕することを実証する。これらの結果は、マルチモーダルエージェントの推論を改善する上での状態を保持する経験モデリングの有効性を検証するものである。

English

Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.

MuSEAgent: 状態保持型経験を持つマルチモーダル推論エージェント

MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences

要旨

Support