자기 수정 에이전트에서의 효용-학습 긴장 관계

초록

시스템이 초지능으로 나아가는 추세 속에서, 에이전트가 자신의 설계의 모든 측면에서 자기 개선을 이룰 수 있다는 것은 자연스러운 모델링 전제로 여겨진다. 우리는 이를 다섯 가지 축으로 분해하고 의사결정 계층을 통해 인센티브와 학습 행동을 분리하여 각 축을 독립적으로 분석함으로써 공식화한다. 우리의 주요 결과는 유틸리티와 학습 간의 긴장, 즉 자기 수정 시스템에서의 구조적 갈등을 식별하고 소개한다. 이 갈등은 즉각적이거나 기대되는 성능을 개선하는 유틸리티 주도의 변화가 신뢰할 수 있는 학습과 일반화를 위한 통계적 전제 조건을 훼손할 수도 있다는 점에서 발생한다. 우리의 연구 결과는 정책 도달 가능 모델 패밀리가 균일하게 용량 제한을 받는 경우에만 분포 독립적 보장이 유지됨을 보여준다. 용량이 무한히 증가할 수 있는 경우, 유틸리티 합리적인 자기 변화는 학습 가능한 작업을 학습 불가능하게 만들 수 있다. 실무에서 일반적으로 사용되는 표준 가정 하에서, 이러한 축들은 동일한 용량 기준으로 축약되어 안전한 자기 수정을 위한 단일 경계를 제공한다. 여러 축에 걸친 수치 실험은 학습 가능성을 보존하는 우리가 제안한 이중 게이트 정책과 파괴적인 유틸리티 정책을 비교함으로써 이론을 검증한다.

English

As systems trend toward superintelligence, a natural modeling premise is that agents can self-improve along every facet of their own design. We formalize this with a five-axis decomposition and a decision layer, separating incentives from learning behavior and analyzing axes in isolation. Our central result identifies and introduces a sharp utility--learning tension, the structural conflict in self-modifying systems whereby utility-driven changes that improve immediate or expected performance can also erode the statistical preconditions for reliable learning and generalization. Our findings show that distribution-free guarantees are preserved iff the policy-reachable model family is uniformly capacity-bounded; when capacity can grow without limit, utility-rational self-changes can render learnable tasks unlearnable. Under standard assumptions common in practice, these axes reduce to the same capacity criterion, yielding a single boundary for safe self-modification. Numerical experiments across several axes validate the theory by comparing destructive utility policies against our proposed two-gate policies that preserve learnability.

자기 수정 에이전트에서의 효용-학습 긴장 관계

Utility-Learning Tension in Self-Modifying Agents

초록

Support