MMA:多模态记忆智能体
MMA: Multimodal Memory Agent
February 18, 2026
作者: Yihao Lu, Wanru Cheng, Zeyu Zhang, Hao Tang
cs.AI
摘要
長時序多模態智能體依賴外部記憶系統,但基於相似度的檢索常會返回陳舊、低可信度或相互衝突的記憶項,從而引發過度自信的錯誤。我們提出多模態記憶智能體(MMA),通過綜合來源可信度、時間衰減和衝突感知的網絡共識機制,爲每個檢索到的記憶項動態分配可靠性評分,並利用該信號重新加權證據,在支持不足時主動棄權。我們同時推出MMA-Bench——一個通過程序化生成的基準測試平臺,用於在可控的說話者可信度與結構化圖文矛盾條件下研究信念動態。借助該框架,我們發現了"視覺安慰劑效應",揭示瞭如何基於RAG的智能體從基礎模型中繼承潛在的視覺偏差。在FEVER數據集上,MMA在保持基準準確率的同時將方差降低35.2%並提升選擇效用;在安全導向的LoCoMo數據集中,特定配置可提升可操作準確率並減少錯誤答案;在MMA-Bench的視覺模式下,MMA達到41.18%的B類準確率,而基準模型在相同協議下崩潰至0.0%。代碼地址:https://github.com/AIGeeksGroup/MMA。
English
Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.