ChatPaper.aiChatPaper

模型应以多快速度接受监督指导?基于Tsallis损失连续体的推理模型训练研究

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

April 28, 2026
作者: Chu-Cheng Lin, Eugene Ie
cs.AI

摘要

在仅具备输出级监督的后训练过程中,当初始成功概率p_0较小时,基于可验证奖励的强化学习(RLVR)难以使推理模型适应新任务。通过引入Tsallis q-对数函数,我们构建了损失函数族J_Q,其可在RLVR(q=0时的利用极点)与潜在轨迹的对数边缘似然(q=1时的密度估计极点)之间连续过渡。该函数族所有成员均保持每样本梯度方向一致,仅通过标量放大因子P_{θ^{-q}}进行差异化调节,该因子可独立于学习率对每个实例重新加权。此放大机制正是解决冷启动停滞的关键:在梯度流下,利用极点需要Ω(1/p_0)时间才能脱离冷启动状态,而密度估计极点仅需Θ(log(1/p_0))时间;中间q值则在逃离速度与噪声记忆之间实现权衡。由于P_θ难以直接计算,我们基于梯度的两种分解形式推导出两种蒙特卡洛估计器:梯度放大强化学习(GARL)从先验分布采样并放大RL梯度,后验衰减微调(PAFT)通过重要性重采样从后验分布获取样本并执行标准SFT。二者均具有O(q/MP_θ^{q+1})的偏差:GARL方差更低,PAFT则能保持语义连贯的梯度。在FinQA、HotPotQA和MuSiQue数据集上的实验表明,q=0.75的GARL能显著缓解冷启动停滞,在GRPO完全失效的场景下成功脱离冷启动。在热启动场景中,低q值的GARL在训练稳定的FinQA任务上表现优异;而在HotPotQA和MuSiQue任务中,GARL训练过程出现失稳,此时q=0.75的PAFT能提供稳定梯度(在HotPotQA上取得47.9 maj@16的最佳效果,较GRPO提升14.4分)。
English
Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).
PDF02May 7, 2026