SAHOO: 再帰的自己改善における高次最適化目標のための安全保護されたアライメント

要旨

再帰的自己改良は理論から実践へ移行しつつある。現代のシステムは自らの出力を批判、修正、評価できるが、反復的な自己修正は微妙なアライメントドリフトを招くリスクがある。本論文では、SAHOOを提案する。これは3つの保護策を通じてドリフトを監視・制御する実用的フレームワークである：(i) 意味的、語彙的、構造的、分布的測定を組み合わせた学習型多信号検出器である目標ドリフト指数（GDI）、(ii) 構文的正確性や非虚構性といった安全上重要な不変条件を強制する制約保存チェック、(iii) 過去の改善を無効にする改良サイクルを警告する回帰リスク定量化。コード生成、数学的推論、真実性における189のタスクにおいて、SAHOOはコードタスクで18.3%、推論タスクで16.8%の大幅な品質向上をもたらし、2つの領域で制約を保持し、真実性において低い違反率を維持した。閾値は、3サイクルにわたる18タスクの小規模検証セットで較正されている。さらに、能力とアライメントのフロンティアをマッピングし、初期の効率的な改良サイクルと後期の上昇するアライメントコストを明らかにするとともに、流暢性と正確性といった領域特異的な緊張関係を曝露している。したがって、SAHOOは再帰的自己改良におけるアライメント保持を、測定可能、展開可能、かつ体系的に大規模検証可能なものとする。

English

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.

SAHOO: 再帰的自己改善における高次最適化目標のための安全保護されたアライメント

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

要旨

Support