원시 경험에서 스킬 소비로: 모델 생성 에이전트 스킬에 대한 체계적 연구

초록

언어 에이전트는 과거 경험에서 정제된 구조화된 절차적 인공물인 기술을 재사용함으로써 점차 개선되고 있다. 특히 도메인 수준 및 모델 생성 기술이 유망하다. 이들은 도메인 특화 반복 절차를 인코딩하여 해당 도메인 내에서 빠른 적응을 가능하게 하며, 노동 집약적인 수작업을 넘어 확장 가능하다. 그러나 추출 방법이 계속해서 증가하는 반면, 이해는 여전히 제한적이며, 경험 생성, 기술 추출, 기술 소비를 아우르는 전체 기술 수명 주기를 포괄하는 종합적인 연구는 부재하여, 이러한 기술이 실제로 작동하는지, 언제 작동하는지, 무엇이 성공 또는 실패를 결정하는지에 대한 질문에 답하지 못하고 있다. 이러한 격차를 해소하기 위해, 우리는 다섯 가지 다양한 에이전트 작업 도메인을 포괄하며 추출기와 대상 에이전트에 걸친 체계적인 실험 결과를 제공하는 유용성 기반 평가 프레임워크를 구축한다. 우리는 모델 생성 기술이 평균적으로 유용하지만 무시할 수 없는 부정적 전이를 나타내며, 추출기와 대상 모두 균일하게 작동하지 않는다는 것을 발견했다. 특정 모델은 강력한 추출기이면서도 약한 소비자가 될 수 있으며, 그 반대의 경우도 가능하며, 기술 유용성은 모델 규모나 기준 작업 성능과 무관하다. 이러한 패턴을 설명하기 위해, 우리는 각 수명 주기 단계를 심층적으로 분석하여 경험 구성이 기술 품질을 어떻게 형성하는지, 유용한 기술의 특성은 무엇인지, 동일한 기술이 다양한 소비자에게 어떻게 전이되는지 살펴본다. 마지막으로, 이러한 발견을 실제 유용성과 연계된 특징을 향해 기술 추출을 안내하는 구체적인 메타 기술로 전환하여, 도메인 전반에서 기술 품질을 일관되게 개선하고 부정적 전이를 상당히 감소시킨다.

English

Language agents increasingly improve by reusing skills -- structured procedural artifacts distilled from past experience. In particular, domain-level and model-generated skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- experience generation, skill extraction, and skill consumption -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete meta-skill that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.