CreativeBench：自己進化型課題による機械の創造性のベンチマークと向上

要旨

高品質な事前学習データの飽和により、研究の焦点は新規性のある成果物を継続的に生成可能な進化型システムへと移行し、AlphaEvolveの成功をもたらしました。しかし、厳密で定量的な評価手法の不足が、こうしたシステムの発展を妨げています。この課題に取り組むため、我々は古典的認知フレームワークに基づく、コード生成における機械の創造性を評価するベンチマーク「CreativeBench」を提案します。本ベンチマークは、リバースエンジニアリングと自己対戦を利用した自動化パイプラインにより、組み合わせ的創造性と探索的創造性に焦点を当てた二つのサブセット「CreativeBench-Combo」と「CreativeBench-Explore」で構成されます。実行可能コードを活用することで、CreativeBenchは、品質と新規性の積として定義された統一指標により、創造性と幻覚を客観的に区別します。最先端モデルに対する分析により、以下の特徴的な振る舞いが明らかになりました：(1) スケーリングは組み合わせ的創造性を大幅に改善するが、探索には収穫逓減の効果が見られる、(2) 大規模モデルは「スケーリングによる収束」を示し、正答率は向上するが多様性は減少する、(3) 推論能力は組み合わせよりも制約付き探索に主に寄与する。最後に、進化的探索パターンを内部化し、機械の創造性を一貫して向上させるプラグアンドプレイの推論時制御戦略「EvoRePE」を提案します。

English

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

CreativeBench：自己進化型課題による機械の創造性のベンチマークと向上

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

要旨

Support