Resa: SAEを介した透明な推論モデル

要旨

言語モデルにおいて、その基盤となる表現を活用して強力な推論能力をいかに費用対効果高く引き出すことができるか？この問いに答えるため、我々はResaという1.5Bの推論モデルファミリーを開発した。これは、新規で効率的なスパースオートエンコーダチューニング（SAE-Tuning）手法を用いて訓練されたものである。この手法では、まずソースモデルから推論能力を抽出するためにSAEを訓練し、その後、訓練済みのSAEを用いて標準的な教師ありファインチューニングプロセスをガイドし、検証済みの質問応答データのみを使用してターゲットモデルにその能力を引き出す。特に、特定のベースモデルに適用し、さらなるRLポストトレーニングを行う前にSAE-Tuningを適用すると、RLトレーニング済みモデルの推論性能の97%以上を維持しつつ、トレーニングコストを2000倍以上削減して約1ドルに、トレーニング時間を450倍以上短縮して約20分に抑えることができる。さらに、軽度にRLトレーニングされたモデル（例えば、2GPUで1時間以内）に適用すると、追加コスト約1ドルで、AIME24で43.33%のPass@1、AMC23で90%のPass@1といった推論性能を実現する。驚くべきことに、SAEを介して抽出された推論能力は、一般化可能かつモジュール化されている可能性がある。一般化とは、あるデータセットから抽出された能力が、より大きく重複するコーパスにおいても性能を向上させることを意味する。モジュール化とは、QwenやQwen-Mathから抽出された能力を、テスト時にR1-Distillモデルに追加し、再トレーニングなしで同等の性能向上をもたらすことを意味する。広範なアブレーション研究によりこれらの知見が検証され、全ての成果物は完全にオープンソース化されている。

English

How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around 1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

Resa: SAEを介した透明な推論モデル

Resa: Transparent Reasoning Models via SAEs

要旨

Support