创新基准:通过自我演进挑战评估与提升机器创造力
CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
March 12, 2026
作者: Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang
cs.AI
摘要
高质量预训练数据的饱和促使研究重心转向能够持续生成新颖产物的进化系统,这推动了AlphaEvolve的成功。然而,此类系统的发展因缺乏严谨的量化评估而受阻。为应对这一挑战,我们基于经典认知框架提出CreativeBench——一个面向代码生成的机器创造力评估基准。该基准包含CreativeBench-Combo和CreativeBench-Explore两个子集,通过逆向工程与自我博弈构建的自动化流程,分别针对组合型与探索型创造力进行评估。借助可执行代码的特性,CreativeBench通过将质量与新颖度乘积定义为统一指标,客观区分创造力与幻觉行为。我们对前沿模型的分析揭示了三种典型行为:(1)模型缩放显著提升组合创造力,但对探索能力的增益呈现边际递减;(2)大模型表现出"缩放收敛"现象,即正确率提升但发散性减弱;(3)推理能力主要助力受限探索任务而非组合创造。最后,我们提出EvoRePE——一种即插即用的推理时引导策略,通过内化进化搜索模式持续增强机器创造力。
English
The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.