HumanAgencyBench:AI助手對人類自主性支持的規模化評估
HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants
September 10, 2025
作者: Benjamin Sturgeon, Daniel Samuelson, Jacob Haimes, Jacy Reese Anthis
cs.AI
摘要
隨著人類將更多任務和決策委託給人工智慧(AI),我們面臨著失去對個人及集體未來掌控的風險。相對簡單的演算法系統已經在引導人類決策,例如社交媒體的動態推送演算法,使人們在無意間、心不在焉地滑動瀏覽那些為提升參與度而優化的內容。本文透過整合哲學與科學的能動性理論與AI輔助評估方法,發展了人類能動性的概念:利用大型語言模型(LLMs)模擬和驗證用戶查詢,並評估AI的回應。我們開發了HumanAgencyBench(HAB),這是一個基於典型AI使用場景、具有六個維度的人類能動性可擴展且自適應的基準。HAB衡量AI助手或代理在以下方面的傾向:提出澄清問題、避免價值操縱、糾正錯誤信息、推遲重要決策、鼓勵學習以及維護社交界限。我們發現,當代基於LLM的助手對能動性的支持程度從低到中等不等,且不同系統開發者和維度之間存在顯著差異。例如,雖然Anthropic的LLM總體上最支持人類能動性,但在避免價值操縱方面卻是支持最少的LLM。能動性支持似乎並未因LLM能力的提升或指令遵循行為(如RLHF)而一致性地增強,我們呼籲轉向更為穩健的安全性和對齊目標。
English
As humans delegate more tasks and decisions to artificial intelligence (AI),
we risk losing control of our individual and collective futures. Relatively
simple algorithmic systems already steer human decision-making, such as social
media feed algorithms that lead people to unintentionally and absent-mindedly
scroll through engagement-optimized content. In this paper, we develop the idea
of human agency by integrating philosophical and scientific theories of agency
with AI-assisted evaluation methods: using large language models (LLMs) to
simulate and validate user queries and to evaluate AI responses. We develop
HumanAgencyBench (HAB), a scalable and adaptive benchmark with six dimensions
of human agency based on typical AI use cases. HAB measures the tendency of an
AI assistant or agent to Ask Clarifying Questions, Avoid Value Manipulation,
Correct Misinformation, Defer Important Decisions, Encourage Learning, and
Maintain Social Boundaries. We find low-to-moderate agency support in
contemporary LLM-based assistants and substantial variation across system
developers and dimensions. For example, while Anthropic LLMs most support human
agency overall, they are the least supportive LLMs in terms of Avoid Value
Manipulation. Agency support does not appear to consistently result from
increasing LLM capabilities or instruction-following behavior (e.g., RLHF), and
we encourage a shift towards more robust safety and alignment targets.