隐性智能——基于用户未言之语评估智能体

摘要

现实世界中对智能体的请求本质上是非完备定义的。人类自然交流依赖于共享语境和未言明的约束条件，说话者期望听者能够自行推断。当前智能体基准测试主要检验显式指令执行能力，却未能评估智能体是否具备推断隐性需求的能力——这些需求涵盖无障碍需求、隐私边界、灾难性风险及情境约束。我们提出"隐性智能"评估框架，通过"世界即代理"测试平台检验AI智能体能否超越指令跟随成为真正的目标实现者。该平台采用人类可读的YAML文件定义交互世界，并由语言模型进行模拟。我们的测试场景具有用户请求表面简单、正确解决方案隐含复杂性、约束条件可通过环境探索发现三大特征。在对16个前沿开源模型进行205个场景测试后，发现即使表现最佳的模型场景通过率也仅为48.3%，这表明在弥合字面指令执行与类人情境推理之间的差距方面仍存在巨大改进空间。

English

Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We present Implicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired with Agent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

隐性智能——基于用户未言之语评估智能体

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

摘要

Support