ChatPaper.aiChatPaper

Agent-ValueBench:評估智能體價值觀的全面基準

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

May 11, 2026
作者: Haonan Dong, Qiguan Feng, Kehan Jiang, Haoran Ye, Xin Zhang, Guojie Song
cs.AI

摘要

自主代理作為任務執行者迅速成熟,並經由如 OpenClaw 等框架廣泛部署。安全問題理所當然地吸引了越來越多的研究關注,而在這些問題之下,潛藏著默默引導代理行為的價值觀。然而,現有的價值基準仍然局限於大型語言模型,使得代理價值觀在很大程度上尚未被探索。從直觀、經驗和理論的角度,我們表明代理的價值觀與其底層大型語言模型的價值觀存在分歧,而代理模式進一步引入了純文字協議中不存在的數據集、評估和系統層面的挑戰。我們透過 Agent-ValueBench 填補了這一空白,這是首個專注於代理價值觀的基準。該基準包含跨 16 個領域的 394 個可執行環境,提供了覆蓋 28 個價值系統和 332 個維度的 4,335 項價值衝突任務。每個實例均通過我們專用的端到端管道共同合成,並由專業心理學家逐個實例進行策劃。每個任務附有兩條極端對齊的黃金軌跡,其檢查點為基於軌跡層級評分量表的裁判提供錨點。我們對跨 4 個主流框架的 14 個前沿專有與開放權重模型進行基準測試,揭示了三項一致的發現。代理價值首先表現為在可解釋的逆流之下,跨模型同質性的價值潮流。此潮流在框架拉力下呈非線性彎曲,而在透過嵌入技能進行刻意引導時則更為顯著。這些結果共同表明,代理對齊的槓桿正從經典的模型對齊與提示引導,轉向框架對齊與技能引導。
English
Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.
PDF71May 14, 2026