MMA：多模态记忆智能体

摘要

長時序多模態智能體依賴外部記憶系統，但基於相似度的檢索常會返回陳舊、低可信度或相互衝突的記憶項，從而引發過度自信的錯誤。我們提出多模態記憶智能體（MMA），通過綜合來源可信度、時間衰減和衝突感知的網絡共識機制，爲每個檢索到的記憶項動態分配可靠性評分，並利用該信號重新加權證據，在支持不足時主動棄權。我們同時推出MMA-Bench——一個通過程序化生成的基準測試平臺，用於在可控的說話者可信度與結構化圖文矛盾條件下研究信念動態。借助該框架，我們發現了"視覺安慰劑效應"，揭示瞭如何基於RAG的智能體從基礎模型中繼承潛在的視覺偏差。在FEVER數據集上，MMA在保持基準準確率的同時將方差降低35.2%並提升選擇效用；在安全導向的LoCoMo數據集中，特定配置可提升可操作準確率並減少錯誤答案；在MMA-Bench的視覺模式下，MMA達到41.18%的B類準確率，而基準模型在相同協議下崩潰至0.0%。代碼地址：https://github.com/AIGeeksGroup/MMA。

English

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

MMA：多模态记忆智能体

MMA: Multimodal Memory Agent

摘要

Support