多模态记忆智能体

摘要

长周期多模态智能体依赖外部记忆，然而基于相似性的检索常会返回过时、低可信度或相互矛盾的记忆项，从而引发过度自信的错误。我们提出多模态记忆智能体（MMA），通过结合信源可信度、时间衰减和冲突感知的网络共识，为每个检索到的记忆项分配动态可靠性评分，并利用该信号重新加权证据，在支持不足时主动弃权。我们还开发了MMA-Bench——一个通过程序化生成的基准测试平台，用于在可控的说话者可靠度及结构化图文矛盾条件下研究信念动态。借助该框架，我们发现了"视觉安慰剂效应"，揭示基于RAG的智能体如何从基础模型中继承潜在的视觉偏见。在FEVER数据集上，MMA在保持基准准确率的同时将方差降低35.2%并提升选择性效用；在安全导向的LoCoMo配置中，可操作准确率得到提升且错误答案减少；在MMA-Bench上，MMA在视觉模式下达到41.18%的B类准确率，而基线模型在相同协议下崩溃至0.0%。代码地址：https://github.com/AIGeeksGroup/MMA。

English

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.