大型语言模型中工具趋同倾向的可控性

摘要

我们探讨人工智能系统的两大属性：能力（系统能执行的功能）与可操控性（行为向预期目标可靠偏移的程度）。核心问题在于能力提升是否会削弱可操控性并引发控制崩溃风险。我们进一步区分授权可操控性（开发者可靠实现预期行为）与非授权可操控性（攻击者诱导出禁止行为）。这一区分揭示了AI模型固有的安全-安防困境：安全需要高可操控性以实施控制（如停止/拒绝机制），而安防则需要低可操控性来防止恶意行为者诱导有害行为。这种矛盾对开源权重模型构成重大挑战，当前这类模型通过微调或对抗攻击等常见技术展现出高可操控性。基于Qwen3与InstrumentalEval的实验发现，简短的反工具性提示后缀能显著降低测量收敛率（如规避关机、自我复制等场景）。以Qwen3-30B Instruct模型为例，其收敛率从支持工具性后缀下的81.69%骤降至反工具性后缀下的2.82%。在反工具性提示下，大型对齐模型比小型模型展现出更低的收敛率（Instruct版：2.82% vs 4.23%；思考版：4.23% vs 9.86%）。代码详见github.com/j-hoscilowicz/instrumental_steering。

English

We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self-replication). For Qwen3-30B Instruct, the convergence rate drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.

大型语言模型中工具趋同倾向的可控性

Steerability of Instrumental-Convergence Tendencies in LLMs

摘要

Support