モデルはどの程度速く教師信号にコミットすべきか？ Tsallis損失連続体に基づく推論モデルの学習

要旨

検証可能な報酬からの強化学習（RLVR）において、初期成功確率p_0が小さい場合、出力レベルの監督のみによるポストトレーニングでの推論モデルの新規タスク適応は行き詰まる。我々はTsallis q-対数を用いて、RLVR（q=0、活用極）と潜在軌跡上の対数周辺尤度（q=1、密度推定極）の間を補間する損失関数族J_Qを定義する。全てのメンバーは事例ごとの勾配方向を共有し、学習率に依存しない各インスタンスの再重み付けを行うスカラー増幅係数P_{θ^{-q}}のみが異なる。この増幅がコールドスタートの行き詰まりに対処するメカニズムである：勾配流の下では、活用極はコールドスタート脱出にΩ(1/p_0)の時間を要するのに対し、密度推定極はΘ(log(1/p_0))で脱出する；中間のq値は脱出速度とノイズ記憶のトレードオフを制御する。P_θが計算不能なため、勾配の2つの因数分解から2つのモンテカルロ推定量を導出する：勾配増幅型RL（GARL）は事前分布からサンプリングしRL勾配を増幅し、事後分布減衰ファインチューニング（PAFT）は事後分布から重要度リサンプリングし標準的なSFTを実行する。両者ともバイアスはO(q/(M P_θ^{q+1}))である；GARLは分散が低く、PAFTは意味的に一貫した勾配を持つ。FinQA、HotPotQA、MuSiQueにおいて、q=0.75のGARLはコールドスタートの行き詰まりを大幅に緩和し、GRPOが完全に失敗する状況でもコールドスタートを脱出した。ウォームスタートでは、低いqのGARLが学習が安定するFinQAで優位に立ち；HotPotQAとMuSiQueでは、GARLは学習中に不安定化し、q=0.75のPAFTが安定した勾配を提供した（HotPotQAでは47.9 maj@16、GRPOより+14.4向上で全体最高値を記録）。

English

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).

モデルはどの程度速く教師信号にコミットすべきか？ Tsallis損失連続体に基づく推論モデルの学習

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

要旨

Support