SAHOO：递归自我改进中高阶优化目标的保障性对齐

摘要

递归式自我改进正从理论走向实践：现代系统已能对自身输出进行批判、修正和评估，但迭代式自我优化可能引发微妙的对齐漂移。我们提出SAHOO框架，通过三重保障机制监控并控制漂移：（一）目标漂移指数（GDI），一种融合语义、词汇、结构及分布度量的多信号检测器；（二）约束保持检查，用于强制执行安全性关键约束（如语法正确性和非虚构性）；（三）回归风险量化，标记可能抵消既往成果的改进循环。在代码生成、数学推理和真实性验证等189项任务中，SAHOO在代码任务上实现18.3%的质量提升，推理任务提升16.8%，同时在两个领域保持约束条件，真实性违规率维持低位。阈值基于跨三个周期的18项任务小型验证集进行校准。我们进一步绘制能力-对齐边界图，揭示早期改进周期的高效性及后期对齐成本上升现象，并暴露领域特定矛盾（如流畅性与事实性的冲突）。SAHOO由此使递归自我改进过程中的对齐保持变得可量化、可部署，并能进行系统性大规模验证。

English

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.