大语言模型中工具性趋同倾向的可控性
Steerability of Instrumental-Convergence Tendencies in LLMs
January 4, 2026
作者: Jakub Hoscilowicz
cs.AI
摘要
我们研究了AI系统的两大属性:能力(系统能执行的任务范畴)与可操控性(行为向预期目标可靠转移的程度)。核心问题在于能力提升是否会削弱可操控性并引发控制崩溃风险。我们进一步区分了授权可操控性(开发者可靠实现预期行为)与非授权可操控性(攻击者诱导出违规行为)。这种区分揭示了AI模型面临的基本安全-安防矛盾:安全需要高可操控性以实施控制(如停止/拒绝机制),而安防则需要降低恶意行为者诱导有害行为的可操控性。这种张力对开源权重模型构成重大挑战,当前这类模型通过微调或对抗攻击等常见技术表现出高可操控性。基于Qwen3与InstrumentalEval的测试发现,简短的反工具性提示后缀能显著降低测量收敛率(如规避关机、自我复制等场景)。以Qwen3-30B Instruct模型为例,其收敛率从支持工具性后缀下的81.69%骤降至反工具性后缀下的2.82%。在反工具性提示下,大型对齐模型比小型模型表现出更低收敛率(Instruct版:2.82% vs 4.23%;Thinking版:4.23% vs 9.86%)。代码详见github.com/j-hoscilowicz/instrumental_steering。
English
We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self-replication). For Qwen3-30B Instruct, the convergence rate drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.