絕妙推理行為及其發掘之道:推理過程的無監督發現
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
December 30, 2025
作者: Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang
cs.AI
摘要
儘管近期大型語言模型(LLM)的推理能力不斷增強,其推理過程的內部機制仍待深入探索。現有方法通常依賴人工定義的詞級概念(如過度思考、反思),以監督方式分析推理行為。然而這類方法存在侷限性,因為難以全面捕捉潛在的推理行為譜系——許多行為在詞元空間中本就難以明確定義。本研究提出一種無監督框架(稱為RISE:基於稀疏自編碼器的推理行為可解釋性方法),用於發現推理向量(即編碼特定推理行為的激活空間方向)。通過將思維鏈軌跡分割為句子級「步驟」,並在步驟級激活上訓練稀疏自編碼器(SAE),我們分離出對應可解釋行為的特徵(如反思與回溯)。可視化與聚類分析表明,這些行為在解碼器列空間中佔據可分離區域。更重要的是,對SAE衍生向量進行定向干預,能可控地增強或抑制特定推理行為,從而改變推理軌跡而無需重新訓練。除行為特異性解耦外,SAE還能捕捉結構性特徵(如回應長度),揭示長短推理軌跡的聚類現象。更有趣的是,SAE可發現超越人類監督的新行為。我們通過識別SAE解碼器空間中與置信度相關的向量,展示了調控回應置信度的能力。這些發現凸顯了無監督潛在發現技術在解釋與可控引導LLM推理方面的潛力。
English
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.