自我修改代理中的效用学习张力
Utility-Learning Tension in Self-Modifying Agents
October 5, 2025
作者: Charles L. Wang, Keir Dorchen, Peter Jin
cs.AI
摘要
随着系统向超级智能发展,一个自然的建模前提是:智能体能够在自身设计的各个维度上进行自我改进。我们通过五轴分解和决策层对此进行了形式化,将激励与学习行为分离,并单独分析各轴。我们的核心成果揭示并引入了一种尖锐的效用-学习张力,即在自我修改系统中,旨在提升即时或预期性能的效用驱动改变,也可能削弱可靠学习与泛化的统计前提条件。研究发现,当且仅当策略可达的模型族具有一致的能力上限时,分布无关的保证才得以保留;若能力可无限增长,基于效用理性的自我改变可能使原本可学习的任务变得不可学习。在实践中的标准假设下,这些轴归结为同一能力准则,从而为安全的自我修改划定了一条单一界限。跨多个轴的数值实验通过对比破坏性效用策略与我们提出的保持可学习性的双门策略,验证了该理论。
English
As systems trend toward superintelligence, a natural modeling premise is that
agents can self-improve along every facet of their own design. We formalize
this with a five-axis decomposition and a decision layer, separating incentives
from learning behavior and analyzing axes in isolation. Our central result
identifies and introduces a sharp utility--learning tension, the structural
conflict in self-modifying systems whereby utility-driven changes that improve
immediate or expected performance can also erode the statistical preconditions
for reliable learning and generalization. Our findings show that
distribution-free guarantees are preserved iff the policy-reachable model
family is uniformly capacity-bounded; when capacity can grow without limit,
utility-rational self-changes can render learnable tasks unlearnable. Under
standard assumptions common in practice, these axes reduce to the same capacity
criterion, yielding a single boundary for safe self-modification. Numerical
experiments across several axes validate the theory by comparing destructive
utility policies against our proposed two-gate policies that preserve
learnability.