学术之爪:当学生为AI智能体设下挑战
AcademiClaw: When Students Set Challenges for AI Agents
May 4, 2026
作者: Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang, Yanjie Wang, Yi Yang, Zijian Hu, Ziyi Yang, Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian, Fengyin Sun, Hao Qiu, Hao Zheng, Haoran Zhu, Hongyu Liu, Jianbin Deng, Jiaxin Song, Jiaying Chi, Jiayou Shi, Jie Fang, Jinghui Zhong, Jingyu Zhou, Jinze Li, Junfeng Yi, Junyan Yu, Junzhi Xue, Ni Song, Pengyi Chen, Qi Chen, Quansheng Li, Rui Tao, Shenghai Gong, Shenhang Lu, Tianqi Shen, Tianxiang Zhu, Tiehan Kang, Tingyu Li, Wendi Wu, Xiao Shen, Xiao Zhou, Xiaotao Zhang, Xinrong Li, Xuankun Yang, Xun Zhang, Yan Li, Ye Lu, Yi Wang, Yibo Zhou, Yichi Zhang, Yihao Sun, Yijun Huang, Yixin Zhu, Yixuan Wu, Yuchen Sun, Yue Wu, Yuheng Sun, Yukun Li, Yutian Tu, Yuxuan Qin, Yuzhuo Wu, Zeyu Li, Zhengyu Lou, Zhenning Ran, Zizhu He, Pengfei Liu
cs.AI
摘要
截至目前,OpenClaw生态系统内的基准测试仅针对助手级任务进行评估,尚未系统检验其学术级能力。我们推出AcademiClaw——一个包含80项复杂长周期任务的双语基准测试集,这些任务直接源自大学生真实学术场景(包括作业、科研项目、竞赛及个人项目),且现有AI智能体均无法有效解决。通过专家严格评审从230项学生提交的候选任务中遴选而成,最终任务集覆盖25+专业领域,从奥林匹克级数学与语言学问题到GPU密集型强化学习及全栈系统调试,其中16项任务需CUDA GPU环境执行。每项任务在隔离的Docker沙箱中运行,通过融合六种互补技术的多维评分标准进行完成度评估,并辅以独立的五维安全审计提供行为分析。对六大前沿模型的实验表明,即使最优模型也仅达到55%的通过率。深入分析揭示了跨任务领域的显著能力边界、模型间的行为策略差异,以及令牌消耗与输出质量之间的脱节现象,提供了超越聚合指标的细粒度诊断信号。我们希望AcademiClaw及其开源数据与代码能成为OpenClaw社区的重要资源,推动智能体在真实学术需求全谱系中实现更高能力与适应性。所有数据与代码详见https://github.com/GAIR-NLP/AcademiClaw。
English
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.