SkillOS: 자기 진화 에이전트를 위한 스킬 큐레이션 학습

초록

LLM 기반 에이전트는 스트리밍 작업을 처리하기 위해 점점 더 많이 배포되고 있지만, 과거 상호작용으로부터 학습하지 못하는 일회성 문제 해결사로 남아 있는 경우가 많습니다. 경험에서 추출된 재사용 가능한 기술은 자기 진화를 위한 자연스러운 기반을 제공하며, 여기서 고품질 기술 큐레이션이 주요 병목 현상으로 작용합니다. 기존 접근법은 수동 기술 큐레이션에 의존하거나 휴리스틱 기술 운영을 규정하거나 단기적인 기술 운영을 학습합니다. 그러나 이러한 방법들도 간접적이고 지연된 피드백으로부터 복잡한 장기 큐레이션 정책을 학습하는 데는 여전히 어려움을 겪고 있습니다. 이러한 문제를 해결하기 위해 우리는 자기 진화 에이전트의 기술 큐레이션 학습을 위한 경험 기반 RL(강화 학습) 학습 방법론인 SkillOS를 제안합니다. SkillOS는 기술을 검색하여 적용하는 고정(frozen) 에이전트 실행기(executor)와 축적된 경험으로부터 외부 SkillRepo를 업데이트하는 훈련 가능한 기술 큐레이터(curator)를 결합합니다. 큐레이션을 위한 학습 신호를 제공하기 위해 우리는 복합 보상 시스템을 설계하고 기술 관련 작업 종속성을 기반으로 그룹화된 작업 스트림에 대해 훈련합니다. 여기서 초기 궤적(trajectory)은 SkillRepo를 업데이트하고, 이후 관련 작업들은 이러한 업데이트를 평가합니다. 다중 턴 에이전트 작업과 단일 턴 추론 작업 전반에 걸쳐 SkillOS는 효과성과 효율성 모두에서 메모리가 없는 강력한 베이스라인과 메모리 기반 베이스라인을 꾸준히 능가하며, 학습된 기술 큐레이터는 서로 다른 실행기 백본과 작업 도메인에 걸쳐 일반화 성능을 보여줍니다. 추가 분석에 따르면 학습된 큐레이터는 더 목표 지향적인 기술 사용을 생성하는 반면, SkillRepo의 기술들은 시간이 지남에 따라 더 높은 수준의 메타 기술을 인코딩하는 더 풍부하게 구조화된 Markdown 파일로 진화합니다.

English

LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.

SkillOS: 자기 진화 에이전트를 위한 스킬 큐레이션 학습

SkillOS: Learning Skill Curation for Self-Evolving Agents

초록

Support