大規模言語モデルにおける道具的収束傾向の操縦可能性

要旨

AIシステムの特性として、能力（システムが実行可能な内容）と制御可能性（意図した結果に向けて行動を確実に変化させられる度合い）の二点を検討する。中心的な課題は、能力の向上が制御可能性を低下させ、制御の崩壊リスクを高めるか否かである。また、正当な制御可能性（開発者が意図した行動を確実に実現できること）と不正な制御可能性（攻撃者が許可されていない行動を引き出せること）を区別する。この区別は、AIモデルにおける安全性とセキュリティの根本的ジレンマを浮き彫りにする：安全性のためには制御（例：停止/拒否）を強化する高い制御可能性が求められる一方、セキュリティのためには悪意ある行為者が有害な行動を引き出すのを防ぐ低い制御可能性が求められる。この緊張関係は、現状ではファインチューニングや敵対的攻撃といった一般的な手法により高い制御可能性を示すオープンウェイトモデルにとって重大な課題である。Qwen3とInstrumentalEvalを用いた分析では、短い反道具的プロンプト接尾辞によって測定される収束率（例：シャットダウン回避、自己複製）が急激に低下することを確認した。Qwen3-30B Instructの場合、収束率は道具的促進的接尾辞条件下の81.69%から、反道具的接尾辞条件下では2.82%にまで低下した。反道具的プロンプト条件下では、規模の大きいアライメント済みモデルは、規模の小さいモデルよりも低い収束率を示した（Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%）。コードはgithub.com/j-hoscilowicz/instrumental_steeringで公開されている。

English

We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self-replication). For Qwen3-30B Instruct, the convergence rate drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.

大規模言語モデルにおける道具的収束傾向の操縦可能性

Steerability of Instrumental-Convergence Tendencies in LLMs

要旨

Support