Agent-ValueBench:评估智能体价值观的综合基准
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
May 11, 2026
作者: Haonan Dong, Qiguan Feng, Kehan Jiang, Haoran Ye, Xin Zhang, Guojie Song
cs.AI
摘要
自主智能体作为任务执行器已迅速成熟,并通过诸如OpenClaw之类的工具框架广泛部署。安全问题引起了日益增长的合理研究关注,而其背后则是默默引导智能体行为的价值观。然而,现有的价值观基准测试仍局限于大语言模型,智能体价值观在很大程度上尚未得到探索。从直觉、经验和理论三个角度,我们证明了智能体的价值观与其底层大语言模型的价值观存在差异,并且智能体模态进一步引入了纯文本协议所不具备的数据集、评估和系统层面的挑战。为弥补这一空白,我们提出了Agent-ValueBench——首个专用于智能体价值观的基准测试。它涵盖16个领域的394个可执行环境,提供4,335个价值冲突任务,覆盖28种价值体系和332个维度。每个实例均通过我们专门构建的端到端流水线协同生成,并由专业心理学家逐例整理。每个任务配有两极对齐的金标准轨迹,其检查点为基于轨迹量规的评判器提供了锚点。我们对14个前沿商业模型和开源权重模型在4个主流工具框架上进行了评测,发现了三个一致的结论。首先,智能体价值观表现为一种跨模型同质性的“价值潮汐”,其中存在可解释的反流。这种潮汐在工具框架牵引下呈现非加和性弯曲,但通过内嵌技能进行有意引导时则更为显著。这些结果共同表明,智能体对齐的杠杆正从经典模型对齐和提示引导转向工具框架对齐和技能引导。
English
Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.