ChatPaper.aiChatPaper

熱身訓練:在資源受限環境下開啟通用推理能力

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

May 19, 2025
作者: Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross
cs.AI

摘要

設計具備有效推理能力的大型語言模型(LLMs)通常需要透過可驗證獎勵的強化學習(RLVR)或精心策劃的長鏈思維(CoT)蒸餾來進行訓練,這兩種方法都高度依賴於大量的訓練數據。當高質量的訓練數據稀缺時,這便形成了一個主要挑戰。我們提出了一種樣本高效的兩階段訓練策略,以在有限的監督下開發推理LLMs。在第一階段,我們通過從玩具領域——即騎士與無賴(K&K)邏輯謎題中蒸餾長鏈思維來“熱身”模型,以獲取一般推理技能。在第二階段,我們使用一組有限的目標領域示例對熱身後的模型應用RLVR。我們的實驗表明,這種兩階段方法具有以下幾個優點:(i)僅熱身階段就能促進泛化推理,從而在一系列任務(包括MATH、HumanEval⁺和MMLU-Pro)中帶來性能提升;(ii)當基礎模型和熱身後的模型在同一小數據集(≤100個示例)上進行RLVR訓練時,熱身後的模型始終優於基礎模型;(iii)在RLVR訓練前進行熱身,使模型即使在特定領域訓練後仍能保持跨領域的泛化能力;(iv)在訓練流程中引入熱身不僅提高了RLVR訓練期間的準確性,還提升了整體樣本效率。本文的結果凸顯了在數據稀缺環境中利用熱身來構建穩健推理LLMs的潛力。
English
Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval^{+}, and MMLU-Pro. (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

Summary

AI-Generated Summary

PDF51May 21, 2025