모델은 감독 학습에 얼마나 빠르게 헌신해야 하는가? Tsallis 손실 연속체를 이용한 추론 모델 훈련

초록

검증 가능한 보상 강화학습(RLVR) 환경에서 초기 성공 확률 p_0가 낮을 경우, 출력 수준 감독만으로 사후 훈련 중 추론 모델을 새로운 작업에 적응시키는 것은 정체됩니다. 우리는 Tsallis q-로그함수를 이용하여 RLVR(탐색 극점 q=0)과 잠재 경로에 대한 로그 주변우도(밀도 추정 극점 q=1) 사이를 보간하는 손실 함수 계열 J_Q를 정의합니다. 모든 구성원은 학습률과 독립적으로 각 인스턴스를 재가중하는 스칼라 증폭 인자 P_{θ^{-q}}만 다를 뿐, 동일한 예시별 기울기 방향을 공유합니다. 이 증폭이 콜드 스타트 정체 해결 메커니즘입니다: 기울기 흐름 하에서 탐색 극점은 콜드 스타트 탈출에 Ω(1/p_0) 시간이 소요되는 반면, 밀도 추정 극점은 Θ(log(1/p_0)) 시간에 탈출합니다; 중간 q 값은 탈출 속도와 노이즈 암기 사이의 트레이드오프를 조절합니다. P_θ는 계산이 불가능하므로, 우리는 기울기의 두 인수분해로부터 두 가지 몬테카를로 추정기를 도출합니다: 기울기 증폭 RL(GARL)은 사전 분포에서 샘플링하여 RL 기울기를 증폭하고, 사후 감쇠 미세 조정(PAFT)은 사후 분포에서 중요도 재샘플링하여 표준 SFT를 실행합니다. 둘 다 O(q/(M P_θ^{q+1}))의 편향을 가지며, GARL은 분산이 낮고 PAFT는 의미론적으로 일관된 기울기를 가집니다. FinQA, HotPotQA, MuSiQue에서 q=0.75의 GARL은 콜드 스타트 정체를 상당히 완화하여 GRPO가 완전히 실패한 상황에서도 콜드 스타트를 탈출했습니다. 웜 스타트 환경에서는 낮은 q의 GARL이 학습이 안정적인 FinQA에서 우세했으며, HotPotQA와 MuSiQue에서는 GARL 학습 중 불안정해졌고, q=0.75의 PAFT가 안정적인 기울기를 제공했습니다(HotPotQA에서 47.9 maj@16, GRPO 대비 +14.4로 전체 최고 성능).

English

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q-logarithm, we define a loss family J_Q that interpolates between RLVR (at q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1{p_0}) time to escape cold start, while the density-estimation pole escapes in Θbig(log(1{p_0})big); intermediate q trades escape speed against noise memorization. Because P_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias Obig(q{M P_θ^{q+1}}big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).

모델은 감독 학습에 얼마나 빠르게 헌신해야 하는가? Tsallis 손실 연속체를 이용한 추론 모델 훈련

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

초록

Support