아카데미클로: 학생들이 AI 에이전트에 도전 과제를 부여할 때

초록

OpenClaw 생태계 내 벤치마크는 지금까지 어시스트 수준의 과제만을 평가해 왔으며, 이로 인해 OpenClaw의 학술적 수준 역량은 대체로 검증되지 않은 상태입니다. 본 논문에서는 AcademiClaw를 소개합니다. 이는 대학생들의 실제 학업 워크플로(과제, 연구 프로젝트, 공모전, 개인 프로젝트)에서 직접 수집한 80개의 복잡하고 장기적인 과제로 구성된 이중 언어 벤치마크로, 학생들이 기존 AI 에이전트로는 효과적으로 해결하지 못했다고 평가한 과제들입니다. 230개에 달하는 학생 제출 후보 과제를 엄격한 전문가 검토를 통해 선별한 최종 과제 세트는 올림피아드 수준의 수학 및 언어학 문제부터 GPU 집약적 강화 학습 및 풀스택 시스템 디버깅에 이르기까지 25개 이상의 전문 분야를 아우르며, 그중 16개 과제는 CUDA GPU 실행을 필요로 합니다. 각 과제는 격리된 Docker 샌드박스에서 실행되며, 6가지 상호 보완적인 기법을 결합한 다차원 평가 기준표를 통해 과제 완수 여부가 채점됩니다. 또한 독립적인 5개 범주의 안전성 감사를 통해 추가적인 행동 분석을 제공합니다. 6개의 최첨단 모델에 대한 실험 결과, 가장 성능이 좋은 모델조차 55%의 통과율에 그치는 것으로 나타났습니다. 심층 분석을 통해 과제 영역별로 뚜렷한 역량 경계, 모델 간 상이한 행동 전략, 토큰 소비량과 출력 품질 간의 괴리가 존재함이 확인되어 종합 지표만으로는 파악하기 어려운 세분화된 진단 신호를 제공합니다. 저희는 AcademiClaw와 그 오픈소스 데이터 및 코드가 OpenClaw 커뮤니티에 유용한 자원이 되어, 현실 세계의 광범위한 학문적 요구를 충족하는 더욱 능력 있고 다재다능한 에이전트 개발로의 진전을 촉진하기를 바랍니다. 모든 데이터와 코드는 https://github.com/GAIR-NLP/AcademiClaw에서 확인할 수 있습니다.

English

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

아카데미클로: 학생들이 AI 에이전트에 도전 과제를 부여할 때

AcademiClaw: When Students Set Challenges for AI Agents

초록

Support