記憶獎勵基準:大型語言模型中長期記憶管理獎勵模型的基準測試
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
January 17, 2026
作者: Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, Min Zhang
cs.AI
摘要
現有研究日益傾向採用記憶中心機制,以分段方式處理長上下文,而有效的記憶管理正是促使大型語言模型在整個序列中有效傳播信息的關鍵能力之一。因此,利用獎勵模型來自動且可靠地評估記憶品質至關重要。本研究提出首個系統性探討獎勵模型評估長程記憶管理能力的基準測試框架MemoryRewardBench,該框架涵蓋長上下文理解與長文本生成兩類任務,包含10種具有不同記憶管理模式的情境設定,上下文長度範圍橫跨8K至128K個詞元。對13個前沿獎勵模型的評估結果顯示,開源模型與專有模型之間的性能差距正逐漸縮小,且新一代模型無論參數規模大小均持續優於前代模型。我們進一步揭示了當前獎勵模型在評估不同情境下LLM記憶管理時的能力與根本性局限。
English
Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.