MMA：マルチモーダルメモリーエージェント

要旨

長期的マルチモーダルエージェントは外部メモリに依存するが、類似性に基づく検索では、陳腐化した信頼性の低い、または矛盾する項目が頻繁に表面化し、過信エラーを引き起こす可能性がある。本論文では、マルチモーダルメモリエージェント（MMA）を提案する。MMAは、検索された各メモリ項目に、情報源の信頼性、時間的減衰、矛盾認識型ネットワーク合意を組み合わせた動的信頼性スコアを割り当て、この信号を用いて証拠の再重み付けを行い、支持が不十分な場合は棄権する。また、話者の信頼性を制御し、構造化されたテキスト-視覚矛盾を持つ信念動態のプログラム生成的ベンチマークであるMMA-Benchを導入する。この枠組みを用いて、RAGベースのエージェントが基盤モデルから潜在的な視覚的バイアスを継承する「視覚的プラセボ効果」を明らかにする。FEVERでは、MMAはベースラインと同等の精度を維持しつつ、分散を35.2%削減し選択的効用を向上させた。安全性重視の設定を施したLoCoMoでは、実用的な精度が向上し誤答が減少した。MMA-Benchでは、MMAはVisionモードで41.18%のType-B精度を達成したが、同じプロトコル下ではベースラインは0.0%に陥った。コード：https://github.com/AIGeeksGroup/MMA。

English

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

MMA：マルチモーダルメモリーエージェント

MMA: Multimodal Memory Agent

要旨

Support