SAHOO：面向递归自我改进高阶优化目标的保障性对齐框架

摘要

递归式自我改进正从理论走向实践：现代系统已能对自身输出进行批判、修正和评估，但迭代式自我修正可能引发微妙的对齐偏移。我们提出SAHOO这一实践框架，通过三重保障机制监控并控制偏移：（一）目标偏移指数（GDI），一种融合语义、词汇、结构及分布度量的多信号检测器；（二）约束保护检查机制，用于维护安全关键性不变条件（如语法正确性及非虚构性）；（三）回归风险量化系统，标记可能抵消既往成果的改进循环。在代码生成、数学推理与真实性验证三大领域的189项任务中，SAHOO在代码任务上实现18.3%的质量提升，推理任务提升16.8%，同时在两个领域保持约束条件，真实性违规率维持低位。阈值校准基于涵盖三个循环的18项任务小型验证集完成。我们进一步绘制能力-对齐边界图谱，揭示早期改进循环的高效性及后期对齐成本上升现象，并展现在流畅度与事实性等领域的特定张力。SAHOO由此使递归自我改进过程中的对齐维护变得可量化、可部署，并能进行系统性大规模验证。

English

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.

SAHOO：面向递归自我改进高阶优化目标的保障性对齐框架

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

摘要

Support