蒂娜:通过LoRA实现的微型推理模型
Tina: Tiny Reasoning Models via LoRA
April 22, 2025
作者: Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger
cs.AI
摘要
如何以高性价比的方式在语言模型中实现强大的推理能力?基于这一根本性问题,我们提出了Tina,一个以高成本效益实现的小型推理模型家族。值得注意的是,Tina展示了仅需极少的资源,通过在强化学习(RL)过程中应用参数高效的更新方法——低秩适应(LoRA),对一个仅有1.5B参数的微型基础模型进行调整,即可开发出显著的推理性能。这种极简主义方法产生的模型,其推理性能不仅与基于同一基础模型构建的SOTA RL推理模型相媲美,有时甚至超越它们。关键在于,这一切仅以现有SOTA模型所需计算后训练成本的一小部分达成。实际上,最佳的Tina模型在AIME24上实现了超过20%的推理性能提升和43.33%的Pass@1准确率,而其后训练与评估成本仅为9美元(即估计成本降低了260倍)。我们的工作揭示了通过LoRA进行高效RL推理的惊人效果。我们在多个开源推理数据集和多种消融设置中验证了这一点,均从一组固定的超参数出发。此外,我们推测这种效果和效率源于LoRA快速使模型适应RL奖励的推理结构格式,同时很大程度上保留了基础模型的底层知识。为了促进可访问性和开放研究,我们完全开源了所有代码、训练日志以及模型权重和检查点。
English
How cost-effectively can strong reasoning abilities be achieved in language
models? Driven by this fundamental question, we present Tina, a family of tiny
reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates
that substantial reasoning performance can be developed using only minimal
resources, by applying parameter-efficient updates during reinforcement
learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B
parameter base model. This minimalist approach produces models that achieve
reasoning performance which is competitive with, and sometimes surpasses, SOTA
RL reasoning models built upon the same base model. Crucially, this is achieved
at a tiny fraction of the computational post-training cost employed by existing
SOTA models. In fact, the best Tina model achieves a >20\% reasoning
performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD
post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our
work reveals the surprising effectiveness of efficient RL reasoning via LoRA.
We validate this across multiple open-source reasoning datasets and various
ablation settings starting with a single, fixed set of hyperparameters.
Furthermore, we hypothesize that this effectiveness and efficiency stem from
LoRA rapidly adapting the model to the structural format of reasoning rewarded
by RL, while largely preserving the base model's underlying knowledge. In
service of accessibility and open research, we fully open-source all code,
training logs, and model weights \& checkpoints.Summary
AI-Generated Summary