创造力基准：通过基于可供性的工具重构评估智能体创造性推理

摘要

大型语言模型的最新进展在推理和环境交互任务上展现出强大性能，但其创造性问题解决能力仍有待探索。我们通过创造性工具使用这一视角研究该能力，即模型通过推理物体的功能可供性和属性来重新利用现有对象，而非依赖常规用法。作为初步探索，我们推出CreativityBench——一个评估LLM基于可供性的创造力的基准测试框架。为此，我们构建了包含4K个实体和15万+条可供性标注的大规模可供性知识库，明确关联物体、部件、属性及可执行用途。基于此知识库，我们生成了1.4万个需要识别约束条件下非显而易见但物理层面可行的解决方案的落地任务。对10个最先进LLM（含闭源和开源模型）的评估表明：模型通常能选择合理物体，但难以识别正确部件、其功能可供性及任务解决所需的底层物理机制，导致性能显著下降。此外，模型规模扩大带来的改进快速饱和，强大的通用推理能力不能可靠转化为创造性可供性发现，而思维链等常见推理时策略收效有限。这些结果表明创造性工具使用仍是当前模型面临的主要挑战，CreativityBench为研究这一缺失的智能维度提供了有效测试平台，对未来智能体的规划与推理模块具有潜在启示意义。

English

Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents.

创造力基准：通过基于可供性的工具重构评估智能体创造性推理

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

摘要

Support