Orchard: オープンソースエージェントモデリングフレームワーク

要旨

エージェントモデリングは、LLMを自律エージェントへと変換し、計画、推論、ツール使用、環境とのマルチターンインタラクションを通じて複雑なタスクを解決することを目的とする。大規模な投資にもかかわらず、オープンな研究はインフラと訓練のギャップに制約されたままである。多くの高性能システムはプロプライエタリなコードベース、モデル、サービスに依存しており、一方でほとんどのオープンソースフレームワークはスケーラブルなエージェント訓練ではなく、オーケストレーションと評価に焦点を当てている。本稿では、スケーラブルなエージェントモデリングのためのオープンソースフレームワークOrchardを提案する。その中核はOrchard Envであり、タスクドメイン、エージェントハーネス、パイプラインフェーズにわたるサンドボックスのライフサイクル管理のための再利用可能なプリミティブを提供する軽量な環境サービスである。Orchard Envの上に、我々は3つのエージェントモデリングレシピを構築する。Orchard-SWEはコーディングエージェントを対象とする。MiniMax-M2.5およびQwen3.5-397Bから107Kの軌跡を蒸留し、未解決軌跡の生産的なセグメントから学習するための信用割当SFTを導入し、RLにはバランス適応型ロールアウトを適用する。Qwen3-30B-A3B-Thinkingを起点として、Orchard-SWEはSFT後にSWE-bench Verifiedで64.3%、SFT+RL後に67.5%を達成し、同程度のサイズのオープンソースモデルの中での新たな最先端を記録する。Orchard-GUIは、わずか0.4Kの蒸留軌跡と2.2Kのオープンエンドタスクのみを用いて、4Bの視覚言語コンピュータ使用エージェントを訓練する。WebVoyager、Online-Mind2Web、DeepShopにおいてそれぞれ74.1%、67.0%、64.0%の成功率を達成し、最強のオープンソースモデルとなると同時に、プロプライエタリシステムとも競争力を維持する。Orchard-Clawはパーソナルアシスタントエージェントを対象とする。わずか0.2Kの合成タスクで訓練され、Claw-Evalでは59.6%のpass@3を達成し、より強力なZeroClawハーネスと組み合わせると73.9%に達する。総じてこれらの結果は、軽量でオープンかつハーネス非依存な環境層が、ドメイン横断的に再利用可能なエージェントデータ、訓練レシピ、評価を可能にすることを示している。

English

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.