最適化の観点から見たLLM思考の修正

要旨

大規模言語モデル（LLM）の近年の進歩は、特に詳細な探索と考察を可能にする長い連鎖思考（CoT）プロンプティングを通じて、創発的な推論能力によって牽引されてきた。しかしながら、こうした進歩にもかかわらず、長いCoTを用いるLLMは、しばしば「考えすぎ」や過度に長い推論連鎖といった、性能を損なう最適ではない推論行動を示す。本論文では、推論プロセスを最適化の観点から分析し、CoTを各推論ステップが問題解決への更新となる勾配降下法の手続きとして捉える。この視点に基づき、我々は学習後調整においてLLMの推論を改善する新しい手法であるRePro（プロセスレベル報酬補正）を提案する。ReProは、CoTの根底にある最適化プロセスを評価する代理目的関数を定義し、その強度と安定性を定量化する二重スコアリング機構を利用する。これらのスコアは複合的なプロセスレベル報酬に集約され、検証可能な報酬を用いた強化学習（RLVR）パイプラインにシームレスに統合されてLLMを最適化する。数学、科学、コーディングにわたるベンチマークで評価された、複数の強化学習アルゴリズムと多様なLLMを用いた大規模な実験により、ReProが推論性能を一貫して向上させ、最適ではない推論行動を軽減することが実証された。

English

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

最適化の観点から見たLLM思考の修正

Rectifying LLM Thought from Lens of Optimization

要旨

Support