超越提示工程：通過目標原子實現大型語言模型的穩健行為控制

摘要

精確控制語言模型的生成對於確保安全性和可靠性至關重要。儘管提示工程和引導技術常被用於干預模型行為，但模型中龐大的參數量往往導致內部表示高度交織。這種相互依賴性可能限制控制精度，有時甚至會引發意外的副作用。近期研究探索了使用稀疏自編碼器（SAE）在高維空間中解構知識以實現引導，然而由於定位原子知識組件的非平凡難題，這些應用僅限於簡單任務。本文提出了一種新方法——引導目標原子（STA），通過分離和操縱解構的知識組件來增強安全性。全面的實驗證明了我們方法的有效性。進一步分析顯示，引導技術展現出卓越的魯棒性和靈活性，特別是在對抗性場景中。我們還將引導策略應用於大型推理模型，證實了其在精確推理控制中的有效性。

English

Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.

超越提示工程：通過目標原子實現大型語言模型的穩健行為控制

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

摘要

Support