ChatPaper.aiChatPaper

HumanAgencyBench:AI助手對人類自主性支持的規模化評估

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

September 10, 2025
作者: Benjamin Sturgeon, Daniel Samuelson, Jacob Haimes, Jacy Reese Anthis
cs.AI

摘要

隨著人類將更多任務和決策委託給人工智慧(AI),我們面臨著失去對個人及集體未來掌控的風險。相對簡單的演算法系統已經在引導人類決策,例如社交媒體的動態推送演算法,使人們在無意間、心不在焉地滑動瀏覽那些為提升參與度而優化的內容。本文透過整合哲學與科學的能動性理論與AI輔助評估方法,發展了人類能動性的概念:利用大型語言模型(LLMs)模擬和驗證用戶查詢,並評估AI的回應。我們開發了HumanAgencyBench(HAB),這是一個基於典型AI使用場景、具有六個維度的人類能動性可擴展且自適應的基準。HAB衡量AI助手或代理在以下方面的傾向:提出澄清問題、避免價值操縱、糾正錯誤信息、推遲重要決策、鼓勵學習以及維護社交界限。我們發現,當代基於LLM的助手對能動性的支持程度從低到中等不等,且不同系統開發者和維度之間存在顯著差異。例如,雖然Anthropic的LLM總體上最支持人類能動性,但在避免價值操縱方面卻是支持最少的LLM。能動性支持似乎並未因LLM能力的提升或指令遵循行為(如RLHF)而一致性地增強,我們呼籲轉向更為穩健的安全性和對齊目標。
English
As humans delegate more tasks and decisions to artificial intelligence (AI), we risk losing control of our individual and collective futures. Relatively simple algorithmic systems already steer human decision-making, such as social media feed algorithms that lead people to unintentionally and absent-mindedly scroll through engagement-optimized content. In this paper, we develop the idea of human agency by integrating philosophical and scientific theories of agency with AI-assisted evaluation methods: using large language models (LLMs) to simulate and validate user queries and to evaluate AI responses. We develop HumanAgencyBench (HAB), a scalable and adaptive benchmark with six dimensions of human agency based on typical AI use cases. HAB measures the tendency of an AI assistant or agent to Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, and Maintain Social Boundaries. We find low-to-moderate agency support in contemporary LLM-based assistants and substantial variation across system developers and dimensions. For example, while Anthropic LLMs most support human agency overall, they are the least supportive LLMs in terms of Avoid Value Manipulation. Agency support does not appear to consistently result from increasing LLM capabilities or instruction-following behavior (e.g., RLHF), and we encourage a shift towards more robust safety and alignment targets.
PDF02September 11, 2025