推理與創造力的權衡:邁向創造力驅動的問題解決之道
The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving
January 2, 2026
作者: Max Ruiz Luyten, Mihaela van der Schaar
cs.AI
摘要
當今最先進的大型語言模型(LLM)管線依賴於自舉式推理循環:通過採樣多樣化的思維鏈並強化得分最高的路徑,主要優化方向為正確性。我們分析此設計選擇如何導致模型在推理路徑上的分佈崩潰,大幅降低語義熵並削弱創造性問題解決能力。為解析此缺陷,我們提出分佈式創造性推理(DCR)——一種將訓練視為解題軌跡機率測度梯度流的統一變分目標。STaR、GRPO、DPO以及熵獎勵等技術均為此損失函數的特例。該框架產生三項核心成果:(i)多樣性衰減定理,闡釋基於正確性的目標如何導致STaR、GRPO與DPO出現不同模式的多樣性衰減;(ii)確保收斂至穩定且多樣化策略的設計方案,有效預防分佈崩潰;(iii)實踐中可操作的簡易實現方法。DCR由此為LLM提供了首個兼顧正確性與創造性的理論實踐框架。
English
State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops: sampling diverse chains of thought and reinforcing the highest-scoring ones, mainly optimizing correctness. We analyze how this design choice is sensitive to the collapse of the model's distribution over reasoning paths, slashing semantic entropy and undermining creative problem-solving. To analyze this failure, we introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces. STaR, GRPO, and DPO, as well as entropy bonuses, and other methods, all constitute special cases of the same loss. The framework delivers three core results: (i) the diversity decay theorem, describing how correctness-based objectives lead to distinct modes of diversity decay for STaR, GRPO, and DPO; (ii) designs that ensure convergence to a stable and diverse policy, effectively preventing collapse; and (iii) simple, actionable recipes to achieve this in practice. DCR thus offers the first principled recipe for LLMs that remain both correct and creative.