Tina: LoRAによる小型推論モデル

要旨

強力な推論能力を言語モデルにおいてどのようにコスト効率良く実現できるか？この根本的な問いに駆られて、我々はTinaという、高いコスト効率で達成された極小規模の推論モデルファミリーを提案する。特に注目すべきは、Tinaが最小限のリソースのみを用いて、強化学習（RL）中にパラメータ効率の良い更新（低ランク適応法：LoRA）を適用することで、わずか1.5Bパラメータの極小規模ベースモデルから、大幅な推論性能を発展させた点である。このミニマリスト的アプローチにより、同じベースモデルに基づいて構築されたSOTA（State-of-the-Art）RL推論モデルと競合し、時には凌駕する推論性能を達成するモデルが生み出された。重要なのは、これが既存のSOTAモデルが用いる計算コストのごく一部で実現されている点である。実際、最良のTinaモデルは、AIME24において20%以上の推論性能向上と43.33%のPass@1精度を達成し、ポストトレーニングと評価コストはわずか9米ドル（推定260倍のコスト削減）であった。我々の研究は、LoRAを介した効率的なRL推論の驚くべき有効性を明らかにした。これを、単一の固定されたハイパーパラメータセットから始めて、複数のオープンソース推論データセットと様々なアブレーション設定にわたって検証した。さらに、この有効性と効率性は、LoRAがRLによって報酬される推論の構造的フォーマットにモデルを迅速に適応させつつ、ベースモデルの基礎知識をほぼ維持することに起因すると仮説を立てた。アクセシビリティとオープンリサーチに貢献するため、すべてのコード、トレーニングログ、モデル重み＆チェックポイントを完全にオープンソース化した。

English

How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a >20\% reasoning performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \& checkpoints.

Tina: LoRAによる小型推論モデル

Tina: Tiny Reasoning Models via LoRA

要旨

Support