MuSEAgent:具备状态化经验的多模态推理智能体
MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences
March 29, 2026
作者: Shijian Wang, Jiarui Jin, Runhao Fu, Zexuan Yan, Xingjian Wang, Mengkang Hu, Eric Wang, Xiaoxi Li, Kangning Zhang, Li Yao, Wenxiang Jiao, Xuelian Cheng, Yuan Lu, Zongyuan Ge
cs.AI
摘要
近期,研究型智能体在跨异构文本与视觉源的信息检索与合成方面取得显著进展。本文提出MuSEAgent——一种多模态推理智能体,通过扩展研究型智能体的状态化经验发掘与利用能力来增强决策水平。不同于依赖轨迹级检索的方法,我们提出状态化经验学习范式,通过事后推理将交互数据抽象为原子化决策经验。这些经验被组织成经过质量筛选的经验库,支持推理阶段基于策略的经验检索。具体而言,MuSEAgent通过互补的广度搜索与深度搜索策略实现自适应经验利用,使智能体能够跨多样化组合语义视角动态检索多模态指导。大量实验表明,在细粒度视觉感知和复杂多模态推理任务上,MuSEAgent均持续优于强轨迹级经验检索基线。这些结果验证了状态化经验建模对提升多模态智能体推理能力的有效性。
English
Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.