ChatPaper.aiChatPaper

具成长与精炼多模态语义记忆的自主学习者

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

November 26, 2025
作者: Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li
cs.AI

摘要

多模态大语言模型在独立查询中展现出强大的推理能力,但其运行方式仍处于"从零开始"状态——每个问题都独立求解,且常常重复相同错误。现有的记忆增强智能体主要存储过往轨迹以供复用,然而基于轨迹的记忆存在简略性偏差,会逐渐丢失关键领域知识。更严重的是,即使在真正的多模态问题解决场景中,此类记忆也仅记录单模态的行为轨迹,未能保留视觉注意力与逻辑推理如何协同促成解决方案。这与人类认知存在根本性错位:语义记忆具有多模态与集成化特性,通过协调而独立的表征流同时保存视觉与抽象知识。 为此,我们提出ViLoMem双流记忆框架,构建基于图式的紧凑型记忆系统。该框架分别编码视觉分心模式与逻辑推理错误,使多模态大语言模型能够从成功与失败经验中学习。遵循"生长-优化"原则,系统以增量方式积累并更新多模态语义知识——在保持稳定、可泛化策略的同时避免灾难性遗忘。在六大多模态基准测试中,ViLoMem持续提升pass@1准确率,并显著减少重复性视觉与逻辑错误。消融实验证实了具有显式分心-幻觉分离的双流记忆的必要性,证明了错误感知型多模态记忆在终身学习与跨领域智能体学习中的价值。项目页面详见https://weihao-bo.github.io/ViLoMeo-page。
English
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
PDF92December 1, 2025