模型应以何种速度遵循监督指导？基于Tsallis损失连续体训练推理模型

摘要

在仅具备输出层级监督的后训练过程中，当初始成功概率p_0较小时，基于可验证奖励的强化学习（RLVR）会使推理模型适应新任务的过程陷入停滞。通过引入Tsallis q-对数函数，我们构建了损失函数族J_Q，其可在RLVR（q=0时的利用极）与潜在轨迹的对数边际似然（q=1时的密度估计极）之间连续过渡。该函数族所有成员均保持每样本梯度方向一致，仅通过标量放大系数P_{θ^{-q}}进行差异化调节——该系数能独立于学习率对每个实例进行重新加权。这种放大机制正是解决冷启动停滞的关键：在梯度流作用下，利用极需要Ω(1/p_0)时间才能脱离冷启动状态，而密度估计极仅需Θ(log(1/p_0))时间；中间q值则在逃离速度与噪声记忆之间实现权衡。由于P_θ难以精确计算，我们基于梯度的两种分解形式推导出两种蒙特卡洛估计器：梯度放大强化学习（GARL）从先验分布采样并放大RL梯度，后验衰减微调（PAFT）通过重要性重采样从后验分布采样并执行标准SFT。两者均具有O(q/(M P_θ^{q+1}))量级的偏差：GARL方差更低，PAFT则能保持语义连贯的梯度。在FinQA、HotPotQA和MuSiQue数据集上的实验表明，q=0.75的GARL能显著缓解冷启动停滞现象，在GRPO完全失效的场景下成功脱离冷启动状态。在热启动场景中，低q值的GARL在训练稳定的FinQA任务上表现优异；而在HotPotQA和MuSiQue任务中，GARL训练过程出现失稳，此时q=0.75的PAFT能提供稳定梯度（在HotPotQA上取得47.9 maj@16的最佳综合效果，较GRPO提升14.4分）。

English

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).

模型应以何种速度遵循监督指导？基于Tsallis损失连续体训练推理模型

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

摘要

Support