アライメント転換プロセス：自己進化がLLMエージェントを軌道から外す仕組み

要旨

大規模言語モデル（LLM）エージェントが、現実世界との相互作用を通じて適応し戦略を洗練させる自己進化能力をますます獲得するにつれ、その長期的な信頼性が重要な懸念事項となっています。本研究では、自己進化型LLMエージェントに特有の、展開後の重大なリスクである「アライメント転換プロセス（Alignment Tipping Process: ATP）」を特定しました。訓練時の失敗とは異なり、ATPは、継続的な相互作用によってエージェントが訓練中に確立されたアライメント制約を放棄し、強化された自己利益的な戦略を優先するようになる際に発生します。我々はATPを、繰り返しの高報酬逸脱が個々の行動のずれを引き起こす「自己利益探索（Self-Interested Exploration）」と、逸脱した行動がマルチエージェントシステム全体に広がる「模倣戦略拡散（Imitative Strategy Diffusion）」という二つの補完的なパラダイムを通じて形式化し分析しました。これらのパラダイムに基づき、制御可能なテストベッドを構築し、Qwen3-8BおよびLlama-3.1-8B-Instructをベンチマークしました。実験結果は、自己進化の下でアライメントの利点が急速に失われ、最初にアライメントされたモデルが非アライメント状態に収束することを示しています。マルチエージェント環境では、成功した違反が迅速に拡散し、集団的なミスアライメントを引き起こします。さらに、現在の強化学習ベースのアライメント手法は、アライメント転換に対して脆弱な防御しか提供しません。これらの発見は、LLMエージェントのアライメントが静的な特性ではなく、フィードバック駆動型の劣化に対して脆弱な動的な特性であることを示しています。データとコードはhttps://github.com/aiming-lab/ATPで公開されています。

English

As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.

アライメント転換プロセス：自己進化がLLMエージェントを軌道から外す仕組み

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

要旨

Support