超越提示工程：通过目标原子实现LLM的稳健行为控制

摘要

对语言模型生成过程的精确控制对于确保其安全性和可靠性至关重要。尽管提示工程和引导技术常被用于干预模型行为，但模型中庞大的参数数量往往导致内部表示高度交织。这种相互依赖性可能限制控制精度，有时还会引发意外副作用。近期研究探索了利用稀疏自编码器（SAE）在高维空间中解耦知识以实现引导，然而，由于定位原子知识组件这一非平凡问题，这些应用仅限于简单任务。本文提出了一种新方法——引导目标原子（STA），通过隔离和操纵解耦的知识组件来增强安全性。全面的实验验证了我们方法的有效性。进一步分析表明，引导展现出卓越的鲁棒性和灵活性，尤其在对抗性场景中表现突出。我们还将这一引导策略应用于大型推理模型，证实了其在精确推理控制中的有效性。

English

Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.

超越提示工程：通过目标原子实现LLM的稳健行为控制

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

摘要

Support