InnoGym:評估人工智慧代理的創新潛力基準測試平台
InnoGym: Benchmarking the Innovation Potential of AI Agents
December 1, 2025
作者: Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang
cs.AI
摘要
大型語言模型與智能代理在程式碼生成、數學推理和科學發現領域取得了顯著進展。然而現有基準主要側重於結果正確性,忽略了解決方案背後方法的多樣性。真正的創新不僅取決於產生正確答案,更關鍵在於解決路徑的原創性。我們提出創新評測框架InnoGym,這是首個系統性評估AI代理創新潛力的基準體系。該框架引入兩項互補指標:衡量對已知最佳方案改進程度的「性能增益」,以及捕捉方法論差異的「新穎性」。該基準包含從真實工程與科學領域精選的18項任務,每項均通過資源篩選、評估驗證和解決方案收集進行標準化處理。此外,我們提供統一執行環境iGym,支持可重現的長週期評估。大量實驗表明,雖然部分代理能產生新穎方法,但其缺乏穩健性限制了性能提升。這些結果揭示了創造力與實效性之間的重要差距,凸顯了同時評估兩類指標的基準體系必要性。
English
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.