ImplicitMemBench：測量大型語言模型中的無意識行為適應

摘要

现有的大语言模型智能体记忆基准主要评估事实的显性记忆能力，却忽视了经验转化为无需意识检索的自动化行为的隐性记忆。这一缺失至关重要：高效的智能助手必须能自动应用习得程序或规避失败操作，而无需显性提示。我们推出首个系统性评估隐性记忆的基准ImplicitMemBench，通过三个基于认知科学标准非陈述性记忆理论的构念：程序性记忆（干扰后单次技能习得）、启动效应（通过实验/对照组配对的主题驱动偏向）以及经典条件反射（条件刺激-非条件刺激关联对首次决策的塑造）。我们采用统一的学习/启动-干扰-测试协议与首次尝试计分法，构建包含300个项目的测试集。对17个模型的评估揭示严重缺陷：总体准确率最高不超过66%，表现最佳者DeepSeek-R1（65.3%）、Qwen3-32B（64.1%）和GPT-5（63.0%）远低于人类基线。分析发现显著不对称性（抑制任务17.6% vs 偏好任务75.0%）及普遍瓶颈，表明需要超越参数扩展的架构创新。ImplicitMemBench将评估范式从"智能体回忆什么"重构为"它们自动执行什么"。

English

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".

ImplicitMemBench：測量大型語言模型中的無意識行為適應

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

摘要

Support