ClawArena: 진화하는 정보 환경에서 AI 에이전트 성능 평가

초록

지속적 어시스턴트로 배포된 AI 에이전트는 정보 환경이 진화함에 따라 올바른 신념을 유지해야 합니다. 실제로 증거는 서로 상충되는 이질적 소스에 분산되어 있으며, 새로운 정보는 기존 결론을 무효화할 수 있고, 사용자 선호도는 명시적 지시가 아닌 수정을 통해 표면화됩니다. 기존 벤치마크는 대부분 정적이고 단일 권위 설정을 가정하며, 에이전트가 이러한 복잡성을 따라갈 수 있는지 평가하지 않습니다. 우리는 진화하는 정보 환경에서 AI 에이전트를 평가하기 위한 벤치마크인 ClawArena를 소개합니다. 각 시나리오는 완전한 숨겨진 실제 정보(Ground Truth)를 유지하면서 에이전트에는 다중 채널 세션, 작업공간 파일, 단계적 업데이트를 통해 노이즈가 포함되고 부분적이며 때로는 상충되는 흔적만 노출합니다. 평가는 세 가지 결합된 과제를 중심으로 구성됩니다: 다중 소스 충돌 추론, 동적 신념 수정, 암묵적 개인화. 이들의 상호작용은 14개 범주의 질문 분류 체계를 생성합니다. 객관식(집합 선택)과 셸 기반 실행 검사라는 두 가지 질문 형식은 추론과 작업공간 기반 파악(Grounding)을 모두 테스트합니다. 현재 버전은 8개 전문 분야에 걸친 64개 시나리오로 구성되며, 총 1,879개 평가 라운드와 365개 동적 업데이트를 포함합니다. 5개 에이전트 프레임워크와 5개 언어 모델에 대한 실험 결과, 모델 성능(15.4% 범위)과 프레임워크 설계(9.2%)가 모두 성능에 상당한 영향을 미치며, 자기 진화 기술 프레임워크가 모델 성능 격차를 부분적으로 해소할 수 있고, 신념 수정의 어려움은 단순한 업데이트 존재 여부가 아닌 업데이트 설계 전략에 의해 결정된다는 것을 보여줍니다. 코드는 https://github.com/aiming-lab/ClawArena 에서 확인할 수 있습니다.

English

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.

ClawArena: 진화하는 정보 환경에서 AI 에이전트 성능 평가

ClawArena: Benchmarking AI Agents in Evolving Information Environments

초록

Support