學術之爪：當學生為AI智能體設下挑戰

摘要

OpenClaw生態系統中的基準測試迄今僅針對助理級任務進行評估，學術級能力範疇尚未得到充分檢驗。我們推出AcademiClaw——一個包含80項複雜長週期任務的雙語基準測試集，所有任務均直接採集自大學生真實學術工作流（包括作業、研究項目、競賽及個人項目），這些任務被證實是當前AI智能體難以有效解決的。經過對230項學生提交任務的嚴格專家評審，最終任務集涵蓋25+專業領域，從奧林匹克級數學與語言學難題，到GPU密集型強化學習及全端系統除錯，其中16項任務需在CUDA GPU環境下執行。每項任務在隔離的Docker沙箱中運行，並透過結合六種互補技術的多維度評分標準進行完成度評估，同時配備獨立五類安全審計提供行為分析。對六款前沿模型的實驗表明，即便最優模型也僅達成55%的通過率。深入分析揭示了跨任務領域的顯著能力邊界、模型間差異化的行為策略，以及標記消耗與輸出品質的脫節現象，提供了超越聚合指標的細粒度診斷信號。我們期待AcademiClaw及其開源數據與代碼能成為OpenClaw社區的有效資源，推動智能體在真實學術需求全譜系中實現更強能力與適應性。所有數據與代碼公開於：https://github.com/GAIR-NLP/AcademiClaw。

English

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

學術之爪：當學生為AI智能體設下挑戰

AcademiClaw: When Students Set Challenges for AI Agents

摘要

Support