自我修改代理中的效用学习张力

摘要

随着系统向超级智能发展，一个自然的建模前提是：智能体能够在自身设计的各个维度上进行自我改进。我们通过五轴分解和决策层对此进行了形式化，将激励与学习行为分离，并单独分析各轴。我们的核心成果揭示并引入了一种尖锐的效用-学习张力，即在自我修改系统中，旨在提升即时或预期性能的效用驱动改变，也可能削弱可靠学习与泛化的统计前提条件。研究发现，当且仅当策略可达的模型族具有一致的能力上限时，分布无关的保证才得以保留；若能力可无限增长，基于效用理性的自我改变可能使原本可学习的任务变得不可学习。在实践中的标准假设下，这些轴归结为同一能力准则，从而为安全的自我修改划定了一条单一界限。跨多个轴的数值实验通过对比破坏性效用策略与我们提出的保持可学习性的双门策略，验证了该理论。

English

As systems trend toward superintelligence, a natural modeling premise is that agents can self-improve along every facet of their own design. We formalize this with a five-axis decomposition and a decision layer, separating incentives from learning behavior and analyzing axes in isolation. Our central result identifies and introduces a sharp utility--learning tension, the structural conflict in self-modifying systems whereby utility-driven changes that improve immediate or expected performance can also erode the statistical preconditions for reliable learning and generalization. Our findings show that distribution-free guarantees are preserved iff the policy-reachable model family is uniformly capacity-bounded; when capacity can grow without limit, utility-rational self-changes can render learnable tasks unlearnable. Under standard assumptions common in practice, these axes reduce to the same capacity criterion, yielding a single boundary for safe self-modification. Numerical experiments across several axes validate the theory by comparing destructive utility policies against our proposed two-gate policies that preserve learnability.

自我修改代理中的效用学习张力

Utility-Learning Tension in Self-Modifying Agents

摘要

Support