ChatPaper.aiChatPaper

训练前热身:在资源受限环境下解锁通用推理能力

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

May 19, 2025
作者: Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross
cs.AI

摘要

设计具备有效推理能力的大型语言模型(LLMs)通常需要采用可验证奖励的强化学习(RLVR)或精心策划的长链思维(CoT)蒸馏方法进行训练,这两种方法都高度依赖于大量的训练数据。当高质量训练数据稀缺时,这便构成了一个主要挑战。我们提出了一种样本高效的两阶段训练策略,以在有限监督下开发推理型LLMs。第一阶段,我们通过在玩具领域——即“骑士与无赖”(K&K)逻辑谜题中蒸馏长链思维来“预热”模型,以获取通用推理技能。第二阶段,我们利用有限的目标域示例对预热后的模型应用RLVR。我们的实验表明,这种两阶段方法具有多重优势:(i)仅预热阶段就能促进泛化推理,提升包括MATH、HumanEval⁺和MMLU-Pro在内的多项任务表现;(ii)当基础模型与预热模型在同一小数据集(≤100个示例)上接受RLVR训练时,预热模型始终优于基础模型;(iii)在RLVR训练前进行预热,使模型在特定领域训练后仍能保持跨域泛化能力;(iv)在训练流程中引入预热,不仅提高了准确性,还提升了RLVR训练的整体样本效率。本文结果凸显了在数据稀缺环境中,预热对于构建稳健推理LLMs的潜力。
English
Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval^{+}, and MMLU-Pro. (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

Summary

AI-Generated Summary

PDF51May 21, 2025