SkillEvolBench: 에피소드 경험으로부터 절차적 기술로의 진화 평가 벤치마크

초록

대규모 언어 모델(LLM) 에이전트는 실제 세계 과제를 해결하는 과정에서 풍부한 에피소드 궤적을 축적하지만, 이러한 경험을 재사용 가능한 절차적 기술로 정제할 수 있는지 여부는 여전히 불분명하다. 본 논문에서는 경험 재사용에서 기술 형성으로의 단계를 평가하기 위한 진단적 벤치마크인 SkillEvolBench를 소개한다. 이 벤치마크는 여섯 가지 실제 세계 에이전트 환경 전반에 걸친 180개의 과제를 포함하며, 공유된 잠재 절차를 가진 역할 조건부 과제군으로 구성된다. 에이전트는 습득 과제로부터 학습하고, 압축된 궤적과 검증기 피드백을 활용하여 외부 기술 라이브러리를 업데이트한 후, 맥락 전환, 적대적 지름길, 조합을 테스트하는 고정 배치 과제에 직면한다. SkillEvolBench는 자기 생성 및 큐레이션 기반 기술 진화를 무기술 및 원시 궤적 통제군과 비교함으로써, 절차적 추상화를 기본 능력, 큐레이션된 사전 지식, 에피소드 흔적의 직접 재사용으로부터 분리한다. 열 가지 모델 구성과 세 가지 에이전트 하네스에 걸친 실험 결과, 현재 에이전트는 종종 국소적으로 적응하지만 강건한 재사용 가능한 기술을 거의 형성하지 못하는 것으로 나타났다. 기술 기반 조건은 습득 또는 재연을 개선할 수 있으며, 개별 모델은 특정 배치 축에서 이점을 얻기도 하지만, 이러한 이점은 고정 배치 상황에서 불안정하다. 원시 궤적 재사용은 정제된 기술보다 자주 우수한 성능을 보이는데, 이는 현재의 추상화 절차가 향후 과제에 여전히 유용한 맥락 및 절차적 단서를 폐기함을 시사한다. 용량 및 비용 분석은 더 많은 기술을 작성하거나 더 큰 계층-3 자원 라이브러리를 구축하는 것만으로는 충분하지 않음을 추가로 보여준다: 추가 업데이트는 적용 범위를 향상시킬 수 있지만, 에피소드 특이적 표류와 절차적 잡음을 유발한다. 이러한 발견은 SkillEvolBench를 일회성 경험이 과제 국소적 메모리가 아닌 지속 가능한 절차적 지식으로 전환되는 시점을 측정하기 위한 테스트베드로 자리매김하게 한다.

English

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.