ChatPaper.aiChatPaper

创意基准测试:通过自我进化挑战评估与增强机器创造力

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

March 12, 2026
作者: Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang
cs.AI

摘要

高質量預訓練數據的飽和已使研究重心轉向能夠持續生成新穎產物的進化系統,這促成了AlphaEvolve的成功。然而,此類系統的發展因缺乏嚴謹的量化評估而受阻。為應對這一挑戰,我們基於經典認知框架提出CreativeBench——一個專注於代碼生成領域機器創造力的評估基準。該基準包含CreativeBench-Combo與CreativeBench-Explore兩個子集,通過結合逆向工程與自我博弈的自動化流程,分別針對組合型與探索型創造力進行評估。藉助可執行代碼的特性,CreativeBench以質量與新穎度的乘積作為統一指標,客觀區分創造力與幻覺現象。我們對前沿模型的實證分析揭示了三類典型行為:(1)模型擴張顯著提升組合創造力,但對探索能力的邊際效益遞減;(2)大型模型呈現「規模化收斂」現象,即正確率提升的同時多樣性降低;(3)推理能力主要助力受限探索任務而非組合創新。最後,我們提出EvoRePE——一種即插即用的推理時引導策略,通過內化進化搜索模式持續增強機器創造力。
English
The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.
PDF61March 30, 2026