MemoryRewardBench:面向大型语言模型长期记忆管理的奖励模型基准评测
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
January 17, 2026
作者: Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, Min Zhang
cs.AI
摘要
现有研究越来越多地采用以记忆为核心的机制对长上下文进行分段处理,而有效的记忆管理是使大语言模型能够在整个序列中有效传递信息的关键能力之一。因此,利用奖励模型自动可靠地评估记忆质量至关重要。本研究推出首个系统评估奖励模型长时记忆管理能力的基准MemoryRewardBench,涵盖长上下文理解与长文本生成两类任务,包含10种具有不同记忆管理模式的场景,上下文长度覆盖8K至128K标记。对13个前沿奖励模型的评估表明:开源模型与专有模型之间的性能差距正在缩小,新一代模型不论参数量大小均持续超越前代模型。我们进一步揭示了当前奖励模型在不同场景下评估LLM记忆管理能力的优势与根本局限。
English
Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.