ChatPaper.aiChatPaper

CL-bench:上下文学习能力基准测试框架

CL-bench: A Benchmark for Context Learning

February 3, 2026
作者: Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao
cs.AI

摘要

当前语言模型(LM)擅长利用预训练知识进行提示推理。然而现实任务更为复杂且高度依赖情境:模型必须从任务特定情境中学习,并运用预训练知识之外的新知识进行推理与任务解决。我们将这种能力称为情境学习——这是人类与生俱来却被长期忽视的关键能力。为此,我们推出CL-bench真实场景基准,包含由领域专家精心设计的500个复杂情境、1,899项任务及31,607条验证规则。每项任务所需的新知识均包含在对应情境中,要求模型从情境中学习包括领域新知、规则体系、复杂流程乃至基于实证数据推导的法则等预训练未接触的内容。这远超主要测试检索或阅读理解的长文本任务,也区别于通过指令示范学习简单任务模式的上下文学习任务。我们对十大前沿模型的评估发现,模型平均仅能解决17.2%的任务,表现最佳的GPT-5.1也仅达到23.7%,表明现有语言模型尚未掌握有效的情境学习能力,这成为应对现实世界复杂情境任务的关键瓶颈。CL-bench旨在推动构建具备这种基础能力的语言模型,使其更智能地适应真实场景应用。
English
Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.
PDF181February 6, 2026