模型应以何种速度遵循监督指导?基于Tsallis损失连续体训练推理模型
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
April 28, 2026
作者: Chu-Cheng Lin, Eugene Ie
cs.AI
摘要
在仅具备输出层级监督的后训练过程中,当初始成功概率p_0较小时,基于可验证奖励的强化学习(RLVR)会使推理模型适应新任务的过程陷入停滞。通过引入Tsallis q-对数函数,我们构建了损失函数族J_Q,其可在RLVR(q=0时的利用极)与潜在轨迹的对数边际似然(q=1时的密度估计极)之间连续过渡。该函数族所有成员均保持每样本梯度方向一致,仅通过标量放大系数P_{θ^{-q}}进行差异化调节——该系数能独立于学习率对每个实例进行重新加权。这种放大机制正是解决冷启动停滞的关键:在梯度流作用下,利用极需要Ω(1/p_0)时间才能脱离冷启动状态,而密度估计极仅需Θ(log(1/p_0))时间;中间q值则在逃离速度与噪声记忆之间实现权衡。由于P_θ难以精确计算,我们基于梯度的两种分解形式推导出两种蒙特卡洛估计器:梯度放大强化学习(GARL)从先验分布采样并放大RL梯度,后验衰减微调(PAFT)通过重要性重采样从后验分布采样并执行标准SFT。两者均具有O(q/(M P_θ^{q+1}))量级的偏差:GARL方差更低,PAFT则能保持语义连贯的梯度。在FinQA、HotPotQA和MuSiQue数据集上的实验表明,q=0.75的GARL能显著缓解冷启动停滞现象,在GRPO完全失效的场景下成功脱离冷启动状态。在热启动场景中,低q值的GARL在训练稳定的FinQA任务上表现优异;而在HotPotQA和MuSiQue任务中,GARL训练过程出现失稳,此时q=0.75的PAFT能提供稳定梯度(在HotPotQA上取得47.9 maj@16的最佳综合效果,较GRPO提升14.4分)。
English
Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).