SkillFlow: 자율 에이전트의 평생 기술 발견 및 진화 벤치마크

초록

자율 에이전트의 역량 한계가 지속적으로 확대됨에 따라, 플러그 앤 플레이 방식의 외부 스킬을 통해 전문적인 작업을 수행할 수 있는 능력도 점차 향상되고 있다. 그러나 현재의 벤치마크는 주로 모델이 제공된 스킬을 사용할 수 있는지 여부를 테스트하는 데 그쳐, 경험을 통해 스킬을 발견하고, 실패 후 수복하며, 시간이 지나도 일관된 라이브러리를 유지할 수 있는지에 대한 여지는 남겨둔다. 본 연구에서는 20개 패밀리(family)에 걸친 166개 작업으로 구성된 SkillFlow 벤치마크를 소개한다. 각 패밀리 내 작업 구성은 에이전트 워크플로우 프레임워크를 정의하는 도메인 독립 실행 흐름(DAEF)을 따르므로, 이러한 작업들이 일관된 워크플로우를 공유할 수 있다. 에이전트는 에이전트 평생 학습 프로토콜 하에서 평가되는데, 이 프로토콜에서는 에이전트가 초기 스킬 없이 시작하여 각 패밀리 내 작업을 순차적으로 해결하고, 궤적 및 루브릭 기반 스킬 패치를 통해 학습 내용을 외부화하며, 갱신된 라이브러리를 이후 작업에 이관한다. 실험 결과 상당한 역량 격차가 확인되었다. Claude Opus 4.6의 경우 평생 스킬 진화를 통해 작업 성공률이 62.65%에서 71.08%로(+8.43점) 향상되었다. 그러나 높은 스킬 사용률이 반드시 높은 유용성으로 이어지지는 않았는데, Kimi K2.5은 66.87%의 스킬 사용률에도 불구하고 +0.60점에 그친 반면, Qwen-Coder-Next는 44.58%의 낮은 작업 완료율을 보였을 뿐만 아니라 기본 설정 대비 성능이 저하되었다. SkillFlow는 이 방향성에 대한 구조화된 테스트베드와 평생 평가 하에서의 스킬 발견, 패칭, 전이 및 그 실패 모드에 대한 심층 실증 분석을 제공한다는 점에서 기여한다.

English

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

SkillFlow: 자율 에이전트의 평생 기술 발견 및 진화 벤치마크

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

초록

Support