HumanAgencyBench: AI 어시스턴트의 인간 주체성 지원에 대한 확장 가능한 평가

초록

인간이 더 많은 작업과 결정을 인공지능(AI)에 위임함에 따라, 우리는 개인적 및 집단적 미래에 대한 통제력을 잃을 위험에 처해 있습니다. 비교적 단순한 알고리즘 시스템은 이미 인간의 의사결정을 주도하고 있으며, 예를 들어 소셜 미디어 피드 알고리즘은 사람들이 무의식적이고 무심코 참여 최적화된 콘텐츠를 스크롤하도록 유도합니다. 본 논문에서는 철학적 및 과학적 행위 이론과 AI 지원 평가 방법을 통합하여 인간 행위 개념을 발전시킵니다: 대규모 언어 모델(LLM)을 사용하여 사용자 질의를 시뮬레이션하고 검증하며, AI 응답을 평가합니다. 우리는 전형적인 AI 사용 사례를 기반으로 인간 행위의 여섯 가지 차원을 포함한 확장 가능하고 적응형 벤치마크인 HumanAgencyBench(HAB)를 개발했습니다. HAB는 AI 어시스턴트 또는 에이전트가 명확한 질문을 요청하고, 가치 조작을 피하며, 잘못된 정보를 수정하고, 중요한 결정을 미루고, 학습을 장려하며, 사회적 경계를 유지하는 경향을 측정합니다. 우리는 현대의 LLM 기반 어시스턴트에서 낮음에서 중간 수준의 행위 지원을 발견했으며, 시스템 개발자와 차원 간에 상당한 차이가 있음을 확인했습니다. 예를 들어, Anthropic의 LLM은 전반적으로 인간 행위를 가장 잘 지원하지만, 가치 조작을 피하는 측면에서는 가장 낮은 지원 수준을 보였습니다. 행위 지원은 LLM 능력이나 지시 따르기 행동(예: RLHF)의 증가와 일관되게 연관되지 않는 것으로 보이며, 우리는 더 강력한 안전성 및 정렬 목표로의 전환을 권장합니다.

English

As humans delegate more tasks and decisions to artificial intelligence (AI), we risk losing control of our individual and collective futures. Relatively simple algorithmic systems already steer human decision-making, such as social media feed algorithms that lead people to unintentionally and absent-mindedly scroll through engagement-optimized content. In this paper, we develop the idea of human agency by integrating philosophical and scientific theories of agency with AI-assisted evaluation methods: using large language models (LLMs) to simulate and validate user queries and to evaluate AI responses. We develop HumanAgencyBench (HAB), a scalable and adaptive benchmark with six dimensions of human agency based on typical AI use cases. HAB measures the tendency of an AI assistant or agent to Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, and Maintain Social Boundaries. We find low-to-moderate agency support in contemporary LLM-based assistants and substantial variation across system developers and dimensions. For example, while Anthropic LLMs most support human agency overall, they are the least supportive LLMs in terms of Avoid Value Manipulation. Agency support does not appear to consistently result from increasing LLM capabilities or instruction-following behavior (e.g., RLHF), and we encourage a shift towards more robust safety and alignment targets.

HumanAgencyBench: AI 어시스턴트의 인간 주체성 지원에 대한 확장 가능한 평가

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

초록

Support