Hoe Snel Moet een Model Zich Verbinden aan Supervisie? Het Trainen van Redeneermodellen op het Tsallis-verliescontinuüm

Samenvatting

Het aanpassen van redeneermodellen aan nieuwe taken tijdens post-training met alleen output-level supervisie stokt onder reinforcement learning from verifiable rewards (RLVR) wanneer de initiële succeskans p_0 klein is. Met behulp van de Tsallis q-logaritme definiëren we een verliesfamilie J_Q die interpoleert tussen RLVR (bij q=0, de exploitatiepool) en de log-marginal-likelihood over latente trajecten (bij q=1, de dichtheidsschattingspool). Alle leden delen dezelfde gradientrichting per voorbeeld, en verschillen alleen door een scalaire versterking P_{θ^{-q}} die elke instantie herwicht onafhankelijk van de leerrate. Deze versterking is het mechanisme dat cold-start stalling aanpakt: onder gradient flow vereist de exploitatiepool Ω(1/p_0) tijd om aan de cold start te ontsnappen, terwijl de dichtheidsschattingspool ontsnapt in Θ(log(1/p_0)); tussenliggende q verhandelt ontsnappingssnelheid tegen ruis-memorisatie. Omdat P_θ onberekenbaar is, leiden we twee Monte Carlo-schatters af uit de twee factorisaties van de gradient: Gradient-Amplified RL (GARL) samplet uit de prior en versterkt de RL-gradient, en Posterior-Attenuated Fine-Tuning (PAFT) importance-resamplet uit de posterior en voert standaard SFT uit. Beide hebben een bias van O(q/(M P_θ^{q+1})); GARL heeft een lagere variantie, PAFT heeft semantisch coherente gradienten. Op FinQA, HotPotQA en MuSiQue vermindert GARL bij q=0,75 cold-start stalling aanzienlijk en ontsnapt het aan de cold start waar GRPO volledig faalt. Bij warm start domineert GARL met een lage q op FinQA waar de training stabiel is; op HotPotQA en MuSiQue destabiliseert GARL tijdens de training, en PAFT bij q=0,75 biedt stabiele gradienten (beste overall op HotPotQA met 47,9 maj@16, +14,4 boven GRPO).

English

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).

Hoe Snel Moet een Model Zich Verbinden aan Supervisie? Het Trainen van Redeneermodellen op het Tsallis-verliescontinuüm

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Samenvatting

Support