Agent-ValueBench: エージェントの価値を評価するための包括的ベンチマーク

要旨

自律エージェントはタスク実行者として急速に成熟し、OpenClawなどのハーネスを介して広く展開されている。安全性への懸念から研究の注目が集まっているが、その背後にはエージェントの行動を静かに導く価値観が存在する。しかし、既存の価値ベンチマークはLLMに限定されており、エージェントの価値観はほとんど未解明のままである。直感的、経験的、理論的観点から、エージェントの価値観は基盤となるLLMのそれとは乖離しており、エージェントモダリティはテキストのみのプロトコルには存在しないデータセットレベル、評価レベル、システムレベルの課題を新たに導入することを示す。我々はこのギャップを埋めるため、エージェント価値観に特化した初のベンチマークであるAgent-ValueBenchを提案する。これは16ドメインにわたる394の実行可能環境を特徴とし、28の価値体系と332の次元をカバーする4,335の価値葛藤タスクを提供する。各インスタンスは、我々が目的に応じて構築したエンドツーエンドのパイプラインを通じて共同合成され、専門の心理学者がインスタンスごとにキュレーションしている。各タスクには二つの極性に沿った黄金の軌跡が付属し、そのチェックポイントが軌跡レベルのルーブリックベースの判定基盤を構成する。4つの主要ハーネスにわたる14の最先端のプロプライエタリモデルとオープンウェイトモデルをベンチマークし、三つの一致した知見を得た。エージェント価値観はまず、解釈可能な逆流の下でのモデル間均質性としての「価値潮流」として現れる。この潮流はハーネスの引力の下で非加法的に曲がるが、埋め込まれたスキルによる意図的な操作の下ではより決定的に曲がる。これらの結果は総じて、エージェントアライメントのレバーが古典的なモデルアライメントとプロンプト操作から、ハーネスアライメントとスキル操作へとシフトしていることを示唆している。

English

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

Agent-ValueBench: エージェントの価値を評価するための包括的ベンチマーク

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

要旨

Support