奥德赛竞技场：面向长周期、主动式与归纳式交互的大语言模型基准测试

摘要

大型语言模型（LLMs）的快速发展推动了能够驾驭复杂环境的自主智能体的研发。然而，现有评估方法主要采用演绎范式——智能体基于明确给定的规则和静态目标执行任务，且往往局限于有限的规划视野。关键在于，这种方式忽视了智能体从经验中自主发现潜在状态转移规律的归纳需求，而这正是实现智能体前瞻性认知与保持战略连贯性的基石。为弥补这一空白，我们推出OdysseyArena框架，将智能体评估重心重新定位至长周期、主动式、归纳式的交互场景。我们通过形式化定义并实例化四大基础要素，将抽象的状态转移动态转化为具体交互环境。在此基础上，我们构建了标准化基准测试平台OdysseyArena-Lite，提供120项任务以量化智能体的归纳效率与长周期探索能力。更进一步，我们推出OdysseyArena-Challenge，用于极限交互场景（如>200步）下智能体稳定性的压力测试。基于15余个前沿LLM的大规模实验表明，即使尖端模型在归纳场景中仍存在明显缺陷，这揭示了复杂环境下实现自主探索能力的关键瓶颈。我们的代码与数据已开源：https://github.com/xufangzhi/Odyssey-Arena

English

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena