学术之爪：当学生为AI智能体设下挑战

摘要

截至目前，OpenClaw生态系统内的基准测试仅针对助手级任务进行评估，尚未系统检验其学术级能力。我们推出AcademiClaw——一个包含80项复杂长周期任务的双语基准测试集，这些任务直接源自大学生真实学术场景（包括作业、科研项目、竞赛及个人项目），且现有AI智能体均无法有效解决。通过专家严格评审从230项学生提交的候选任务中遴选而成，最终任务集覆盖25+专业领域，从奥林匹克级数学与语言学问题到GPU密集型强化学习及全栈系统调试，其中16项任务需CUDA GPU环境执行。每项任务在隔离的Docker沙箱中运行，通过融合六种互补技术的多维评分标准进行完成度评估，并辅以独立的五维安全审计提供行为分析。对六大前沿模型的实验表明，即使最优模型也仅达到55%的通过率。深入分析揭示了跨任务领域的显著能力边界、模型间的行为策略差异，以及令牌消耗与输出质量之间的脱节现象，提供了超越聚合指标的细粒度诊断信号。我们希望AcademiClaw及其开源数据与代码能成为OpenClaw社区的重要资源，推动智能体在真实学术需求全谱系中实现更高能力与适应性。所有数据与代码详见https://github.com/GAIR-NLP/AcademiClaw。

English

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

学术之爪：当学生为AI智能体设下挑战

AcademiClaw: When Students Set Challenges for AI Agents

摘要

Support