奧德賽競技場：針對長時程、主動性與歸納式互動的大型語言模型基準測試

摘要

大型語言模型（LLM）的快速發展推動了能夠在複雜環境中自主導航的智能代理技術。然而，現有評估方法主要採用演繹範式：代理通常在有限的規劃視野內，基於明確給定的規則與靜態目標執行任務。這種模式關鍵性地忽略了代理從經驗中自主歸納潛在轉換規律的能力——此能力正是實現代理前瞻性與維持策略連貫性的基石。為彌合這一差距，我們提出OdysseyArena框架，將代理評估重心重新定位於長時程、主動性與歸納性交互。我們通過形式化定義並實例化四種核心要素，將抽象的狀態轉換動態轉化為具體可交互的環境。基於此，我們建立OdysseyArena-Lite標準化基準測試平台，提供120項任務以量化代理的歸納效率與長時程規律發現能力。進一步地，我們推出OdysseyArena-Challenge，用於壓力測試代理在極端交互跨度（如超過200步）下的穩定性。對15餘種前沿LLM的大規模實驗表明，即使最先進的模型在歸納場景中仍存在明顯缺陷，這揭示了複雜環境下實現自主發現能力的關鍵瓶頸。相關代碼與數據已開源於：https://github.com/xufangzhi/Odyssey-Arena

English

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena

奧德賽競技場：針對長時程、主動性與歸納式互動的大型語言模型基準測試

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

摘要

Support