SAHOO: 재귀적 자기 개선을 위한 고차 최적화 목표의 안전한 정렬

초록

재귀적 자기 개선이 이론에서 실전으로 나아가고 있다: 현대 시스템은 자신의 출력을 비판, 수정, 평가할 수 있지만, 반복적인 자기 수정은 미세한 정렬 이탈(alignment drift) 위험을 수반한다. 본 연구에서는 세 가지 안전장치를 통해 이탈을 감시하고 제어하는 실용적 프레임워크인 SAHOO를 소개한다: (i) 의미론적, 어휘적, 구조적, 분포적 측정을 결합한 학습 기반 다중 신호 탐지기인 목표 이탈 지수(GDI); (ii) 구문 정확성과 허구적 내용 생성 방지(non-hallucination) 같은 안전 핵심 불변 조건을 강제하는 제약 조건 보존 검사; (iii) 기존 개선 성과를 훼손하는 개선 주기를 경고하기 위한 회귀 위험 정량화. 코드 생성, 수학적 추론, 진실성 분야의 189개 과제에서 SAHOO는 코드 과제 18.3%, 추론 과제 16.8% 개선을 포함한 상당한 품질 향상을 달성하면서 두 영역에서 제약 조건을 보존하고 진실성 영역에서 낮은 위반 수준을 유지했다. 임계값은 3주기에 걸친 18개 과작의 소규모 검증 세트에서 조정되었다. 또한 능력-정렬 경계선(capability-alignment frontier)을 매핑하여, 초기 개선 주기에서는 효율적인 향상이 이루어지지만 후기로 갈수록 정렬 비용이 증가하며, 유창성 대 사실성과 같은 영역별 긴장 관계를 드러냈다. 따라서 SAHOO는 재귀적 자기 개선 과정 중 정렬 보존을 측정 가능하고 배포 가능하며 체계적으로 대규모 검증할 수 있게 한다.

English

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.

SAHOO: 재귀적 자기 개선을 위한 고차 최적화 목표의 안전한 정렬

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

초록

Support