精彩推理行为及其发现之道:无监督探索推理过程
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
December 30, 2025
作者: Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang
cs.AI
摘要
尽管近期大语言模型(LLM)的推理能力不断增强,但其在推理过程中的内部机制仍待深入探索。现有方法通常依赖人工定义的概念(如过度思考、反思),在词汇层面以监督方式分析推理行为。然而这类方法存在局限,因为难以全面捕捉潜在的推理行为谱系,其中许多行为无法在词元空间中明确定义。本文提出一种无监督框架(命名为RISE:基于稀疏自编码器的推理行为可解释性方法),用于发现推理向量——即激活空间中编码不同推理行为的特定方向。通过将思维链轨迹分割为句子级"步骤",并在步骤级激活上训练稀疏自编码器(SAE),我们分离出对应可解释行为(如反思与回溯)的特征。可视化与聚类分析表明,这些行为在解码器列空间中占据可分离区域。进一步地,对SAE衍生向量进行定向干预,可可控地增强或抑制特定推理行为,无需重新训练即可改变推理轨迹。除行为特异性解耦外,SAE还能捕获结构特性(如响应长度),呈现长短推理轨迹的聚类分布。更有趣的是,SAE能够发现超越人类监督的新行为。我们通过识别SAE解码器空间中的置信相关向量,展示了调控响应置信度的能力。这些发现印证了无监督潜在发现方法在解释和可控引导LLM推理方面的潜力。
English
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.