암묵적 지능 -- 사용자가 말하지 않는 부분으로 에이전트 평가하기

초록

실제 세계에서 AI 에이전트에 대한 요청은 근본적으로 명세가 불충분하다. 자연스러운 인간의 의사소통은 화자가 청자가 추론할 것이라 기대하는 공유된 맥락과 명시되지 않은 제약에 의존한다. 현재의 에이전트 벤치마크는 명시적 지시 따르기를 테스트하지만, 접근성 요구사항, 프라이버시 경계, 재앙적 위험, 맥락적 제약에 걸친 암묵적 요구사항을 에이전트가 추론할 수 있는지는 평가하지 못한다. 우리는 AI 에이전트가 단순한 프롬프트 수행을 넘어 진정한 목표 달성자가 될 수 있는지를 평가하는 '암묵적 지능(Implicit Intelligence)' 평가 프레임워크와, 인간이 읽을 수 있는 YAML 파일로 상호작용 세계를 정의하고 언어 모델로 시뮬레이션하는 '에이전트-월드(Agent-as-a-World, AaW)' 하네스를 제안한다. 우리의 시나리오는 사용자 요청의 겉보기 단순성, 올바른 해결책의 숨겨진 복잡성, 환경 탐색을 통한 제약 조건의 발견 가능성을 특징으로 한다. 205개 시나리오에서 16개의 최첨단 및 오픈 웨이트 모델을 평가한 결과, 가장 성능이 좋은 모델조차 시나리오 통과율이 48.3%에 그쳐, 문자 그대로의 지시 수행과 인간과 같은 맥락적 추론 사이의 격차를 해소하는 데 상당한 개선의 여지가 있음을 보여준다.

English

Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We present Implicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired with Agent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

암묵적 지능 -- 사용자가 말하지 않는 부분으로 에이전트 평가하기

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

초록

Support