ChatPaper.aiChatPaper

CREATE:测试大语言模型的联想创造力

CREATE: Testing LLMs for Associative Creativity

March 10, 2026
作者: Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett
cs.AI

摘要

创造力的关键组成部分在于联想推理能力:即在概念间建立新颖且具意义联系的能力。我们推出CREATE基准测试,旨在评估模型的创造性联想推理能力。该测试要求模型在其参数化知识体系中生成连接概念的多条路径,这些路径需具备高特异性(概念连接的独特性与紧密性)和高多样性(与其他路径的差异度),且模型生成的优质多元路径越多,得分越高。此项任务与假设生成等真实创造力任务具有共同需求——包括极大的搜索空间,但能通过客观答案评分收集大规模基准数据。对前沿模型的评估表明,最强模型能获得比其他模型更高的创意效用值,但由于答案的多元性和搜索复杂性,基准测试难以达到饱和状态。此外,我们的结果表明,即使拥有高令牌预算,思维模型在此任务中并非总是更有效。近期创新的提示工程技术仅能带来有限提升。CREATE为开发新方法提供了沙盒环境,以增强模型的联想创造力。
English
A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
PDF122March 15, 2026