アカデミクロー：学生がAIエージェントに挑戦を仕掛けるとき

要旨

OpenClawエコシステムにおけるベンチマークはこれまで、アシスタントレベルのタスクに限定して評価が行われており、学術レベルの能力はほとんど検証されていません。本研究では、大学生の実際の学術ワークフロー（授業課題、研究プロジェクト、コンペティション、個人プロジェクト）から抽出された80の複雑な長期タスクで構成される二言語ベンチマーク「AcademiClaw」を提案します。これらは既存のAIエージェントでは効果的に解決できないと学生が判断した課題です。230件の学生提出候補から専門家による厳格な審査を経て選定された最終タスクセットは、オリンピアード級の数学・言語学問題からGPU集約型強化学習、フルスタックシステムデバッグまで25以上の専門領域にわたり、うち16タスクはCUDA GPUでの実行を必要とします。各タスクは隔離されたDockerサンドボックスで実行され、6つの補完的技法を組み合わせた多次元評価基準によるタスク達成度で採点されます。さらに独立した5分類の安全性監査が行動分析を追加提供します。6つの先端モデルによる実験では、最高性能モデルでも55%の合格率に留まることが示されました。詳細分析からは、タスク領域間の明確な能力境界、モデル間での異なる行動戦略、トークン消費量と出力品質の乖離が明らかとなり、集計指標だけでは捉えられない細粒度の診断信号を提供します。AcademiClawとそのオープンソース化されたデータ・コードが、OpenClawコミュニティにとって有用なリソースとなり、現実世界の学術要求の全幅にわたってより高能力で汎用的なエージェントの開発を推進することを期待します。全データとコードはhttps://github.com/GAIR-NLP/AcademiClaw で公開されています。

English

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

アカデミクロー：学生がAIエージェントに挑戦を仕掛けるとき

AcademiClaw: When Students Set Challenges for AI Agents

要旨

Support