Resa:通过稀疏自编码器实现透明推理模型
Resa: Transparent Reasoning Models via SAEs
June 11, 2025
作者: Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger
cs.AI
摘要
我们如何通过利用语言模型的底层表示,以最具成本效益的方式激发其强大的推理能力?我们通过Resa系列模型回答了这个问题,这是一组15亿参数的推理模型,采用了一种新颖且高效的稀疏自编码器调优(SAE-Tuning)方法进行训练。该方法首先训练一个稀疏自编码器(SAE)从源模型中捕捉推理能力,随后利用训练好的SAE指导标准的有监督微调过程,以在目标模型中激发这些能力,整个过程仅使用经过验证的问答数据,无需任何推理轨迹。值得注意的是,当应用于某些基础模型并在进一步强化学习(RL)后训练之前,SAE-Tuning保留了其RL训练对应模型超过97%的推理性能,同时将训练成本降低了超过2000倍至约1美元,训练时间缩短了超过450倍至约20分钟。此外,当应用于经过轻度RL训练的模型(例如,在2个GPU上训练1小时内),它仅需增加约1美元的成本,就能实现如AIME24上43.33%的Pass@1和AMC23上90%的Pass@1的推理性能。令人惊讶的是,通过SAE提取的推理能力可能既具有通用性又具有模块化特性。通用性意味着从一个数据集中提取的能力仍能在更大且重叠的语料库上提升性能。模块化则意味着从Qwen或Qwen-Math提取的能力可以在测试时直接附加到R1-Distill模型上,无需任何重新训练,就能带来可比的性能提升。广泛的消融实验验证了这些发现,所有相关资源均已完全开源。
English
How cost-effectively can we elicit strong reasoning in language models by
leveraging their underlying representations? We answer this question with Resa,
a family of 1.5B reasoning models trained via a novel and efficient sparse
autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to
capture reasoning abilities from a source model, and then uses the trained SAE
to guide a standard supervised fine-tuning process to elicit such abilities in
a target model, all using verified question-answer data without any reasoning
traces. Notably, when applied to certain base models before further RL
post-training, SAE-Tuning retains >97% of its RL-trained counterpart's
reasoning performance while reducing training costs by >2000x to roughly \1
and training time by >450x to around 20 minutes. Furthermore, when applied to
lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning
performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only
around 1 additional cost. Surprisingly, the reasoning abilities extracted via
SAEs are potentially both generalizable and modular. Generality means abilities
extracted from one dataset still elevate performance on a larger and
overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math
can be attached to the R1-Distill model at test time, without any retraining,
and yield comparable gains. Extensive ablations validate these findings and all
artifacts are fully open-sourced.