Mem-α：通過強化學習構建記憶模型

摘要

大型語言模型（LLM）代理受限於有限的上下文窗口，因此需要外部記憶系統來實現長期信息理解。當前增強記憶的代理通常依賴於預定義的指令和工具來更新記憶。然而，隨著記憶系統變得更加複雜，語言模型可能缺乏確定存儲哪些信息、如何結構化這些信息以及何時更新這些信息的能力。這導致了次優的記憶構建和信息丟失。為此，我們提出了Mem-alpha，這是一個強化學習框架，通過互動和反饋訓練代理有效管理複雜的記憶系統。我們還構建了一個專門的訓練數據集，涵蓋多樣化的多輪互動模式，並配備了全面的評估問題，旨在教授有效的記憶管理。在訓練過程中，代理處理序列信息塊，學習提取和存儲相關內容，然後更新記憶系統。獎勵信號來自於對完整互動歷史的下游問答準確性，直接優化記憶構建。為了展示我們訓練框架的有效性，我們設計了一個由核心、情節和語義組件組成的記憶架構，配備了多種記憶操作工具。實證評估表明，Mem-alpha在現有增強記憶代理基線上取得了顯著改進。儘管僅在最大長度為30k個標記的實例上進行訓練，我們的代理在超過400k個標記的序列上表現出顯著的泛化能力，超過訓練長度的13倍，這凸顯了Mem-alpha的魯棒性。

English

Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.

Mem-α：通過強化學習構建記憶模型

Mem-α: Learning Memory Construction via Reinforcement Learning

摘要

Support