Orchard: 오픈소스 에이전트 모델링 프레임워크

초록

에이전틱 모델링(Agentic modeling)은 LLM을 계획, 추론, 도구 사용, 환경과의 다중 턴 상호작용을 통해 복잡한 작업을 해결할 수 있는 자율 에이전트로 변환하는 것을 목표로 한다. 막대한 투자에도 불구하고, 공개 연구는 인프라 및 학습 격차로 인해 여전히 제약을 받고 있다. 많은 고성능 시스템은 독점 코드베이스, 모델, 또는 서비스에 의존하는 반면, 대부분의 오픈소스 프레임워크는 확장 가능한 에이전트 학습보다는 오케스트레이션 및 평가에 초점을 맞추고 있다. 본 논문은 확장 가능한 에이전틱 모델링을 위한 오픈소스 프레임워크인 Orchard를 제시한다. 핵심은 Orchard Env로, 작업 도메인, 에이전트 하네스, 파이프라인 단계 전반에 걸쳐 샌드박스 수명 주기 관리를 위한 재사용 가능한 프리미티브를 제공하는 경량 환경 서비스이다. Orchard Env 위에 세 가지 에이전틱 모델링 레시피를 구축했다. Orchard-SWE는 코딩 에이전트를 대상으로 한다. MiniMax-M2.5 및 Qwen3.5-397B에서 107K 개의 궤적을 증류하고, 미해결 궤적의 생산적 세그먼트로부터 학습하기 위해 크레딧 할당 SFT를 도입했으며, RL에 Balanced Adaptive Rollout을 적용했다. Qwen3-30B-A3B-Thinking을 시작으로, Orchard-SWE는 SFT 후 SWE-bench Verified에서 64.3%, SFT+RL 후 67.5%를 달성하여 유사한 규모의 오픈소스 모델 중 새로운 최첨단 성능을 기록했다. Orchard-GUI는 0.4K 개의 증류 궤적과 2.2K 개의 개방형 작업만을 사용하여 4B 시각-언어 컴퓨터 사용 에이전트를 학습한다. WebVoyager, Online-Mind2Web, DeepShop에서 각각 74.1%, 67.0%, 64.0%의 성공률을 달성하여, 독점 시스템과 경쟁력을 유지하면서 가장 강력한 오픈소스 모델이 되었다. Orchard-Claw는 개인 비서 에이전트를 대상으로 한다. 단 0.2K 개의 합성 작업만으로 학습되어 Claw-Eval에서 59.6%의 pass@3을 달성하고, 더 강력한 ZeroClaw 하네스와 결합 시 73.9%를 달성한다. 이러한 결과는 경량의 개방형 하네스 비의존적 환경 계층이 도메인 전반에 걸쳐 재사용 가능한 에이전틱 데이터, 학습 레시피 및 평가를 가능하게 함을 종합적으로 보여준다.

English

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.