人类能动性基准：AI助手中人类能动性支持的可扩展评估

摘要

随着人类将更多任务和决策权委托给人工智能（AI），我们正面临失去对个人及集体未来掌控的风险。相对简单的算法系统已在引导人类决策，例如社交媒体推送算法导致人们无意识地浏览优化参与度的内容。本文通过整合哲学与科学中的能动性理论及AI辅助评估方法，发展了人类能动性的概念：利用大型语言模型（LLMs）模拟和验证用户查询，并评估AI的响应。我们开发了HumanAgencyBench（HAB），一个基于典型AI应用场景、包含六个维度人类能动性的可扩展自适应基准。HAB衡量AI助手或代理在以下方面的倾向：询问澄清问题、避免价值操纵、纠正错误信息、推迟重要决策、鼓励学习以及维护社交边界。我们发现，当前基于LLM的助手对人类能动性的支持程度处于低至中等水平，且不同系统开发者和维度间存在显著差异。例如，尽管Anthropic的LLM在整体上最支持人类能动性，但在避免价值操纵方面却是支持最少的。能动性支持似乎并不一致地源于LLM能力的提升或指令遵循行为（如RLHF），我们鼓励转向更稳健的安全性和对齐目标。

English

As humans delegate more tasks and decisions to artificial intelligence (AI), we risk losing control of our individual and collective futures. Relatively simple algorithmic systems already steer human decision-making, such as social media feed algorithms that lead people to unintentionally and absent-mindedly scroll through engagement-optimized content. In this paper, we develop the idea of human agency by integrating philosophical and scientific theories of agency with AI-assisted evaluation methods: using large language models (LLMs) to simulate and validate user queries and to evaluate AI responses. We develop HumanAgencyBench (HAB), a scalable and adaptive benchmark with six dimensions of human agency based on typical AI use cases. HAB measures the tendency of an AI assistant or agent to Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, and Maintain Social Boundaries. We find low-to-moderate agency support in contemporary LLM-based assistants and substantial variation across system developers and dimensions. For example, while Anthropic LLMs most support human agency overall, they are the least supportive LLMs in terms of Avoid Value Manipulation. Agency support does not appear to consistently result from increasing LLM capabilities or instruction-following behavior (e.g., RLHF), and we encourage a shift towards more robust safety and alignment targets.

人类能动性基准：AI助手中人类能动性支持的可扩展评估

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

摘要

Support