模型应以多快速度接受监督指导？基于Tsallis损失连续体的推理模型训练研究

摘要

在仅具备输出级监督的后训练过程中，当初始成功概率p_0较小时，基于可验证奖励的强化学习（RLVR）难以使推理模型适应新任务。通过引入Tsallis q-对数函数，我们构建了损失函数族J_Q，其可在RLVR（q=0时的利用极点）与潜在轨迹的对数边缘似然（q=1时的密度估计极点）之间连续过渡。该函数族所有成员均保持每样本梯度方向一致，仅通过标量放大因子P_{θ^{-q}}进行差异化调节，该因子可独立于学习率对每个实例重新加权。此放大机制正是解决冷启动停滞的关键：在梯度流下，利用极点需要Ω(1/p_0)时间才能脱离冷启动状态，而密度估计极点仅需Θ(log(1/p_0))时间；中间q值则在逃离速度与噪声记忆之间实现权衡。由于P_θ难以直接计算，我们基于梯度的两种分解形式推导出两种蒙特卡洛估计器：梯度放大强化学习（GARL）从先验分布采样并放大RL梯度，后验衰减微调（PAFT）通过重要性重采样从后验分布获取样本并执行标准SFT。二者均具有O(q/MP_θ^{q+1})的偏差：GARL方差更低，PAFT则能保持语义连贯的梯度。在FinQA、HotPotQA和MuSiQue数据集上的实验表明，q=0.75的GARL能显著缓解冷启动停滞，在GRPO完全失效的场景下成功脱离冷启动。在热启动场景中，低q值的GARL在训练稳定的FinQA任务上表现优异；而在HotPotQA和MuSiQue任务中，GARL训练过程出现失稳，此时q=0.75的PAFT能提供稳定梯度（在HotPotQA上取得47.9 maj@16的最佳效果，较GRPO提升14.4分）。

English

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).