MMA: 멀티모달 메모리 에이전트

초록

장기적 다중모달 에이전트는 외부 메모리에 의존하지만, 유사도 기반 검색은 종종 신뢰도가 낮거나 상충되는 오래된 정보를 표면화하여 과도한 확신 오류를 유발할 수 있습니다. 본 연구에서는 검색된 각 메모리 항목에 출처 신뢰도, 시간적 감쇠, 충돌 인식 네트워크 합의를 결합한 동적 신뢰도 점수를 부여하고, 이 신호를 활용하여 증거 가중치를 재조정하며 지원이 불충분할 경우 판단을 유보하는 다중모달 메모리 에이전트(MMA)를 제안합니다. 또한 발화자 신뢰도가 제어되고 구조화된 텍스트-시각 정보 모순이 포함된 프로그램 방식 생성 벤치마크인 MMA-Bench를 소개합니다. 이 프레임워크를 통해 RAG 기반 에이전트가 기초 모델의 잠재적 시각 편향을 어떻게 계승하는지 보여주는 "시각적 플라시보 효과"를 규명합니다. FEVER에서 MMA는 기준 모델 대비 정확도를 유지하면서 분산을 35.2% 감소시키고 선택적 유용성을 개선했으며, 안전 중심 구성의 LoCoMo에서는 실행 가능 정확도를 향상시키고 오답을 줄였습니다. MMA-Bench에서는 비전 모드에서 MMA가 41.18%의 Type-B 정확도를 달성한 반면, 동일 프로토콜 하에서 기준 모델은 0.0%로 성능이 붕괴되었습니다. 코드: https://github.com/AIGeeksGroup/MMA.

English

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

MMA: 멀티모달 메모리 에이전트

MMA: Multimodal Memory Agent

초록

Support