隐式记忆基准：衡量大型语言模型中的无意识行为适应性

摘要

现有的大语言模型智能体记忆基准主要评估事实的显性记忆能力，却忽视了经验转化为无需意识检索的自动化行为的隐性记忆。这一缺失至关重要：高效的智能助手必须能自动应用习得程序或规避失败操作，而无需显性提示。我们推出首个系统性基准ImplicitMemBench，通过三个基于认知科学经典理论的隐性记忆构念进行评估：程序性记忆（干扰后单次技能习得）、启动效应（通过实验/对照组配对比对实现主题驱动偏差）以及经典条件反射（条件刺激-非条件刺激关联影响首次决策）。我们的300项测试集采用统一的学习/启动-干扰-测试协议，并以首次尝试准确率评分。对17个模型的评估揭示严重缺陷：总体得分最高不超过66%，表现最佳的DeepSeek-R1（65.3%）、Qwen3-32B（64.1%）和GPT-5（63.0%）远低于人类基线。分析发现显著不对称性（抑制任务17.6% vs 偏好任务75.0%）及普遍存在的瓶颈问题，表明需要超越参数扩展的架构创新。ImplicitMemBench将评估重心从"智能体回忆什么"转向"它们自动执行什么"。

English

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".

隐式记忆基准：衡量大型语言模型中的无意识行为适应性

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

摘要

Support