ChatPaper.aiChatPaper

具備生長與精煉多模態語義記憶的能動學習體

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

November 26, 2025
作者: Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li
cs.AI

摘要

多模態大語言模型在處理孤立查詢時展現出強大的推理能力,但它們的運作方式始終是從零開始——每個問題都獨立求解,且往往重複相同的錯誤。現有的記憶增強型智能體主要儲存過往的執行軌跡以供重複使用。然而,基於軌跡的記憶存在簡略性偏差,會逐漸流失關鍵的領域知識。更重要的是,即使在真正的多模態解題情境中,這類記憶也僅記錄了單模態的行為軌跡,未能保存視覺注意力與邏輯推理如何協同促成解決方案的過程。這種機制與人類認知存在根本性錯位:語義記憶兼具多模態與整合性特質,通過協調且表徵方式互異的雙重路徑來保存視覺與抽象知識。為此,我們提出ViLoMem——一種雙流記憶框架,能建構基於圖式的精簡記憶。該框架分別編碼視覺分心模式與邏輯推理錯誤,使多模態大語言模型能從成功與失敗經驗中學習。遵循「生長-精煉」原則,系統逐步累積並更新多模態語義知識,既保留穩定、可泛化的策略,又避免災難性遺忘。在六大多模態基準測試中,ViLoMem持續提升pass@1準確率,並顯著減少重複的視覺與邏輯錯誤。消融實驗證實了具備顯性分心-幻覺分離機制的雙流記憶的必要性,彰顯了錯誤感知型多模態記憶在終身學習與跨領域智能體學習中的價值。我們的項目頁面將發布於:https://weihao-bo.github.io/ViLoMeo-page。
English
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
PDF92December 1, 2025