多模态记忆智能体
MMA: Multimodal Memory Agent
February 18, 2026
作者: Yihao Lu, Wanru Cheng, Zeyu Zhang, Hao Tang
cs.AI
摘要
长周期多模态智能体依赖外部记忆,然而基于相似性的检索常会返回过时、低可信度或相互矛盾的记忆项,从而引发过度自信的错误。我们提出多模态记忆智能体(MMA),通过结合信源可信度、时间衰减和冲突感知的网络共识,为每个检索到的记忆项分配动态可靠性评分,并利用该信号重新加权证据,在支持不足时主动弃权。我们还开发了MMA-Bench——一个通过程序化生成的基准测试平台,用于在可控的说话者可靠度及结构化图文矛盾条件下研究信念动态。借助该框架,我们发现了"视觉安慰剂效应",揭示基于RAG的智能体如何从基础模型中继承潜在的视觉偏见。在FEVER数据集上,MMA在保持基准准确率的同时将方差降低35.2%并提升选择性效用;在安全导向的LoCoMo配置中,可操作准确率得到提升且错误答案减少;在MMA-Bench上,MMA在视觉模式下达到41.18%的B类准确率,而基线模型在相同协议下崩溃至0.0%。代码地址:https://github.com/AIGeeksGroup/MMA。
English
Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.