ClawArena：动态信息环境下的AI智能体基准测试平台

摘要

作为持久化助手部署的AI智能体，必须在其信息环境演变过程中保持正确的信念认知。实践中，证据往往分散在相互矛盾的异构信息源中，新信息可能推翻先前结论，而用户偏好通常通过修正行为而非明确指令显现。现有基准测试大多基于静态单信源设定，未能评估智能体能否应对这种复杂性。我们推出ClawArena基准测试框架，用于评估AI智能体在动态信息环境中的表现。每个场景都包含完整的隐藏事实真相，而智能体仅能通过多通道会话、工作区文件和阶段性更新接触带有噪声、不完整且时而矛盾的痕迹线索。评估围绕三大耦合挑战展开：多源冲突推理、动态信念修正和隐性个性化，其相互作用形成了包含14类问题的分类体系。通过多选题（集合选择）和基于命令行的可执行检查两种题型，同时检验推理能力与工作区落地效果。当前版本涵盖8个专业领域的64个场景，包含1,879次评估轮次和365次动态更新。对五种智能体框架和五种语言模型的实验表明：模型能力（15.4%差异区间）与框架设计（9.2%差异）均显著影响性能，自演进技能框架可部分弥补模型能力差距，且信念修正难度取决于更新设计策略而非单纯更新频次。代码已发布于https://github.com/aiming-lab/ClawArena。

English

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.