Resa:基於SAE的透明推理模型
Resa: Transparent Reasoning Models via SAEs
June 11, 2025
作者: Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger
cs.AI
摘要
我們如何能有效地利用語言模型的內在表徵來激發其強大的推理能力?針對這一問題,我們提出了Resa系列,這是一組擁有15億參數的推理模型,通過一種新穎且高效的稀疏自編碼器微調(SAE-Tuning)方法進行訓練。該方法首先訓練一個稀疏自編碼器(SAE)以從源模型中捕捉推理能力,隨後利用訓練好的SAE來指導標準的監督式微調過程,從而激發目標模型中的此類能力,整個過程僅使用經過驗證的問答數據,無需任何推理軌跡。值得注意的是,當將SAE-Tuning應用於某些基礎模型並在進一步的強化學習(RL)後訓練之前,它能夠保留其RL訓練對應模型推理性能的97%以上,同時將訓練成本降低超過2000倍至約1美元,並將訓練時間縮短超過450倍至約20分鐘。此外,當應用於輕度RL訓練的模型(例如,在2個GPU上訓練1小時內),它僅需約1美元的額外成本即可實現如AIME24上43.33%的Pass@1和AMC23上90%的Pass@1的推理性能。令人驚訝的是,通過SAE提取的推理能力可能既具有通用性又具有模塊化特性。通用性意味著從一個數據集中提取的能力仍能提升在更大且重疊語料庫上的性能。模塊化則意味著從Qwen或Qwen-Math提取的能力可以在測試時附加到R1-Distill模型上,無需任何重新訓練,即可獲得相當的性能提升。大量的消融實驗驗證了這些發現,所有相關資源均已完全開源。
English
How cost-effectively can we elicit strong reasoning in language models by
leveraging their underlying representations? We answer this question with Resa,
a family of 1.5B reasoning models trained via a novel and efficient sparse
autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to
capture reasoning abilities from a source model, and then uses the trained SAE
to guide a standard supervised fine-tuning process to elicit such abilities in
a target model, all using verified question-answer data without any reasoning
traces. Notably, when applied to certain base models before further RL
post-training, SAE-Tuning retains >97% of its RL-trained counterpart's
reasoning performance while reducing training costs by >2000x to roughly \1
and training time by >450x to around 20 minutes. Furthermore, when applied to
lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning
performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only
around 1 additional cost. Surprisingly, the reasoning abilities extracted via
SAEs are potentially both generalizable and modular. Generality means abilities
extracted from one dataset still elevate performance on a larger and
overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math
can be attached to the R1-Distill model at test time, without any retraining,
and yield comparable gains. Extensive ablations validate these findings and all
artifacts are fully open-sourced.