ClawArena:在動態資訊環境中評測人工智慧代理的基準平台
ClawArena: Benchmarking AI Agents in Evolving Information Environments
April 5, 2026
作者: Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao
cs.AI
摘要
在動態資訊環境中部署的AI持久助手必須能夠隨著資訊環境的演變維持正確的信念。實際應用中,證據往往分散在相互矛盾的異構來源裡,新資訊可能使先前結論失效,而用戶偏好通常透過修正而非明確指令顯現。現有基準測試大多假設靜態單一權威的設定,未能評估智能體能否應對這種複雜性。我們提出ClawArena基準測試框架,用於評估AI智能體在演變資訊環境中的表現。每個場景都包含完整的隱藏事實真相,但僅向智能體呈現跨多通道會話、工作區文件和階段性更新中的雜訊干擾、局部且時而矛盾的資訊痕跡。評估圍繞三個相互關聯的挑戰展開:多源衝突推理、動態信念修正和隱性個性化,其交互作用形成了包含14類問題的分類體系。通過多選題(集合選擇)和基於Shell的可執行檢查兩種問題形式,同時測試推理能力與工作區落地實踐。當前版本涵蓋8個專業領域的64個場景,共計1,879個評估回合和365次動態更新。對五種智能體框架和五種語言模型的實驗表明:模型能力(15.4%區間差異)與框架設計(9.2%差異)均對性能產生顯著影響,自我演進的技能框架可部分彌補模型能力差距,而信念修正的難度取決於更新設計策略而非單純的更新存在與否。程式碼已開源於https://github.com/aiming-lab/ClawArena。
English
AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.