學術之爪:當學生為AI智能體設下挑戰
AcademiClaw: When Students Set Challenges for AI Agents
May 4, 2026
作者: Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang, Yanjie Wang, Yi Yang, Zijian Hu, Ziyi Yang, Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian, Fengyin Sun, Hao Qiu, Hao Zheng, Haoran Zhu, Hongyu Liu, Jianbin Deng, Jiaxin Song, Jiaying Chi, Jiayou Shi, Jie Fang, Jinghui Zhong, Jingyu Zhou, Jinze Li, Junfeng Yi, Junyan Yu, Junzhi Xue, Ni Song, Pengyi Chen, Qi Chen, Quansheng Li, Rui Tao, Shenghai Gong, Shenhang Lu, Tianqi Shen, Tianxiang Zhu, Tiehan Kang, Tingyu Li, Wendi Wu, Xiao Shen, Xiao Zhou, Xiaotao Zhang, Xinrong Li, Xuankun Yang, Xun Zhang, Yan Li, Ye Lu, Yi Wang, Yibo Zhou, Yichi Zhang, Yihao Sun, Yijun Huang, Yixin Zhu, Yixuan Wu, Yuchen Sun, Yue Wu, Yuheng Sun, Yukun Li, Yutian Tu, Yuxuan Qin, Yuzhuo Wu, Zeyu Li, Zhengyu Lou, Zhenning Ran, Zizhu He, Pengfei Liu
cs.AI
摘要
OpenClaw生態系統中的基準測試迄今僅針對助理級任務進行評估,學術級能力範疇尚未得到充分檢驗。我們推出AcademiClaw——一個包含80項複雜長週期任務的雙語基準測試集,所有任務均直接採集自大學生真實學術工作流(包括作業、研究項目、競賽及個人項目),這些任務被證實是當前AI智能體難以有效解決的。經過對230項學生提交任務的嚴格專家評審,最終任務集涵蓋25+專業領域,從奧林匹克級數學與語言學難題,到GPU密集型強化學習及全端系統除錯,其中16項任務需在CUDA GPU環境下執行。每項任務在隔離的Docker沙箱中運行,並透過結合六種互補技術的多維度評分標準進行完成度評估,同時配備獨立五類安全審計提供行為分析。對六款前沿模型的實驗表明,即便最優模型也僅達成55%的通過率。深入分析揭示了跨任務領域的顯著能力邊界、模型間差異化的行為策略,以及標記消耗與輸出品質的脫節現象,提供了超越聚合指標的細粒度診斷信號。我們期待AcademiClaw及其開源數據與代碼能成為OpenClaw社區的有效資源,推動智能體在真實學術需求全譜系中實現更強能力與適應性。所有數據與代碼公開於:https://github.com/GAIR-NLP/AcademiClaw。
English
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.