InT: 自己提案介入による大規模言語モデルの推論における信用割り当ての実現

要旨

結果報酬型強化学習（RL）は大規模言語モデル（LLM）の推論能力向上に有効であることが実証されている。しかし、標準的なRLは最終回答のみで信用配分を行うため、結果が誤っている場合には推論過程全体がペナルティを受け、正しい場合にはすべてのステップが一律に強化される。この結果、失敗した推跡では正しい中間ステップが抑制され、成功した推跡では誤ったステップが強化される可能性がある。我々はこの問題を**信用配分問題**と呼ぶ。自然な解決策はプロセス報酬モデルを訓練することだが、修正すべき推論ステップを特定するためにこのようなモデルを正確に最適化することは依然として困難である。本論文では**介入訓練（InT）** を提案する。これは、モデル自身が短く焦点を絞った修正を提案することで、より高い報酬に向けて軌道を導き、自身の推論過程に対して細かい信用配分を行う訓練パラダイムである。数学的推論データセットで一般的に利用可能な参照解答を使用し、モデル生成された解答を検証することがゼロから正しい解答を生成するよりも容易であるという事実を利用して、モデルは自身の推論における最初の誤りを特定し、正しい解に向けて軌道をリダイレクトするための単一ステップの介入を提案する。次に、誤りが生じた時点までのオン方策ロールアウトと介入を連結したものに対して教師ありファインチューニング（SFT）を適用し、失敗を引き起こした特定のステップに誤りを局在化させる。これによって得られたモデルは、RL訓練のためのはるかに優れた初期化として機能することを示す。InTとそれに続くRLを用いたファインチューニングを実施後、IMO-AnswerBenchにおいて4Bパラメータのベースモデルより精度を約14%向上させ、gpt-oss-20bなどの大規模オープンソースモデルを上回る性能を達成した。

English

Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.

InT: 自己提案介入による大規模言語モデルの推論における信用割り当ての実現

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

要旨

Support