SEAL: 에이전트와 학습 환경의 시너지적 공진화

초록

대규모 언어 모델(LLM) 에이전트는 상호작용을 통해 점차 개선되고 있지만, 대부분의 자기 진화 방법은 정책 또는 학습 환경을 개별적으로 적응시킨다. 우리는 이러한 구조적 격차를 에이전트-환경 부정합(Agent-Environment Misalignment)으로 식별한다. 즉, 훈련 중 에이전트의 능력 경계는 변화하는 반면, 감독을 제공하는 환경은 정적으로 유지되거나 에이전트가 드러낸 실패와 약하게만 연결된다. 우리는 대화형 도구 사용 에이전트를 위한 폐루프 공동 진화 프레임워크인 SEAL을 제안한다. SEAL은 실행 가능 검증 하에서 온-정책 궤적을 수집하고, 실패한 롤아웃을 진단하여 턴 단위 실패 레이블로 변환하며, 이 진단을 환경 측 적응과 모델 측 정책 최적화 모두를 위한 공유 신호로 사용한다. 환경은 더 명확한 도구 사용 신호, 제약 정보, 복구 지향 피드백을 노출함으로써 훈련 시 학습 인터페이스를 진화시키고, 정책은 진단 기반 이점 재가중을 통해 업데이트된다. 분포 내 및 분포 외 다중 턴 도구 사용 평가에 걸친 광범위한 실험은 SEAL이 저자원 에이전트 학습을 개선함을 보여준다. 단 400개의 훈련 샘플로 세 가지 백본에서 평균 8.25~26.25점의 향상을 가져오며, 긍정적인 분포 외 전이를 나타낸다. 이러한 결과는 강건한 자기 개선 LLM 에이전트를 위해 학습자와 그 훈련 시 학습 기반을 공동으로 적응시키는 가치를 입증한다.

English

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as Agent-Environment Misalignment: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.