ClawArena：進化する情報環境におけるAIエージェントのベンチマーキング

要旨

持続的なアシスタントとして展開されるAIエージェントは、情報環境が変化する中で正しい信念を維持しなければなりません。実際には、証拠は異種の情報源に散在しており、しばしば互いに矛盾し、新しい情報によって以前の結論が無効化される可能性があり、ユーザーの選好は明示的な指示ではなく修正を通じて表面化します。既存のベンチマークは、主に静的な単一権威設定を想定しており、エージェントがこの複雑性に対応できるかどうかを評価していません。私たちは、変化する情報環境におけるAIエージェントを評価するベンチマーク「ClawArena」を紹介します。各シナリオは完全な隠されたグラウンドトゥルースを維持しながら、エージェントにはマルチチャネルセッション、ワークスペースファイル、段階的な更新を通じて、ノイズの多い、部分的な、時には矛盾する痕跡のみを公開します。評価は、複数情報源の矛盾推論、動的信念修正、暗黙的パーソナライゼーションという3つの連携した課題を中心に構成され、これらの相互作用により14カテゴリの質問分類体系が生まれます。複数選択（集合選択）とシェルベースの実行可能チェックという2つの質問形式は、推論とワークスペースの接地の両方をテストします。現在のリリースでは、8つの専門領域にわたる64のシナリオを含み、合計1,879評価ラウンドと365の動的更新を提供します。5つのエージェントフレームワークと5つの言語モデルによる実験では、モデル能力（15.4%の範囲）とフレームワーク設計（9.2%）の両方がパフォーマンスに大きく影響すること、自己進化型スキルフレームワークがモデル能力ギャップを部分的に埋められること、信念修正の難易度が更新の単なる有無ではなく更新設計戦略によって決定されることが示されました。コードはhttps://github.com/aiming-lab/ClawArenaで公開されています。

English

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.

ClawArena：進化する情報環境におけるAIエージェントのベンチマーキング

ClawArena: Benchmarking AI Agents in Evolving Information Environments

要旨

Support