测试LLMs的联想创造力

摘要

创造力的核心要素在于联想推理能力：即在概念间建立新颖而具意义联系的能力。我们推出CREATE基准测试，旨在评估模型的创造性联想推理能力。该测试要求模型在其参数化知识体系中生成连接概念的多条路径，这些路径需具备高特异性（概念关联的独特性和紧密性）与高多样性（路径间的差异性），且模型生成的优质多样化路径越多，得分越高。此项任务与假设生成等真实创造力任务具有共同要求——包括应对极大规模搜索空间，同时能通过客观答案评分构建大规模基准测试。对前沿模型的评估表明，最强模型能获得更高的创意效用值，但由于答案的高度多重性和搜索复杂性，基准测试难以达到饱和状态。此外，实验结果证明思维模型在本任务中并非总是更有效，即使给予高额令牌预算亦然。近期提出的创意提示方法虽能带来有限提升，但改进幅度有限。CREATE为开发新方法提供了沙盒环境，以增强模型的联想创造力。

English

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.