Agent-ValueBench: 에이전트 가치 평가를 위한 포괄적 벤치마크

초록

자율 에이전트는 작업 실행자로서 빠르게 성숙해졌으며, OpenClaw와 같은 하네스를 통해 광범위하게 배포되고 있다. 안전 문제는 당연히 연구의 주목을 받고 있으며, 그 이면에는 에이전트 행동을 조용히 이끄는 가치가 자리잡고 있다. 그러나 기존의 가치 벤치마크는 LLM에 한정되어 있어 에이전트 가치는 거의 탐구되지 않은 상태이다. 직관적, 경험적, 이론적 관점에서 우리는 에이전트의 가치가 기반 LLM의 가치와 다르다는 것을 보여주며, 에이전트 양식은 텍스트 전용 프로토콜에는 없는 데이터셋, 평가, 시스템 수준의 도전 과제를 추가로 도입한다. 우리는 에이전트 가치에 전념하는 최초의 벤치마크인 Agent-ValueBench로 이 격차를 해소한다. 이 벤치마크는 16개 도메인에 걸쳐 394개의 실행 가능한 환경을 갖추며, 28개의 가치 체계와 332개 차원을 포괄하는 4,335개의 가치 갈등 과제를 제공한다. 모든 인스턴스는 당사가 특별히 구축한 엔드투엔드 파이프라인을 통해 공동 합성되며, 전문 심리학자가 인스턴스별로 선별한다. 각 과제에는 두 개의 극 정렬 황금 궤적이 포함되어 있으며, 이 궤적의 체크포인트는 궤적 수준의 루브릭 기반 평가자를 고정한다. 4개의 주류 하네스에 걸쳐 14개의 최첨단 독점 및 오픈 가중치 모델을 벤치마킹한 결과, 세 가지 일관된 발견을 확인했다. 에이전트 가치는 먼저 해석 가능한 역류 아래에서 교차 모델 동질성의 가치 조류로 나타난다. 이 조류는 하네스 당김에 따라 비가산적으로 휘어지며, 내장된 스킬을 통한 의도적 조종에 의해 더욱 결정적으로 휘어진다. 이러한 결과는 에이전트 정렬 레버가 고전적인 모델 정렬 및 프롬프트 조종에서 하네스 정렬 및 스킬 조종으로 이동하고 있음을 시사한다.

English

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

Agent-ValueBench: 에이전트 가치 평가를 위한 포괄적 벤치마크

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

초록

Support