ChatPaper.aiChatPaper

创新工场:评估AI智能体的创新潜力基准

InnoGym: Benchmarking the Innovation Potential of AI Agents

December 1, 2025
作者: Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang
cs.AI

摘要

大型语言模型与智能体在代码生成、数学推理和科学发现领域取得了显著进展。然而,现有基准测试主要关注结果正确性,忽视了解决方案背后方法的多样性。真正的创新不仅取决于答案的正确性,更取决于方法的原创性。我们推出InnoGym——首个系统评估AI智能体创新潜力的基准测试框架。该框架引入两个互补指标:衡量对已知最优方案改进程度的性能增益,以及捕捉方法论差异的新颖性指标。该基准包含从真实工程与科学领域精选的18项任务,每项均通过资源筛选、评估验证和方案收集实现标准化。我们还提供统一执行环境iGym,支持可复现的长周期评估。大量实验表明,虽然部分智能体能提出新颖方法,但其鲁棒性不足限制了性能提升。这些结果揭示了创造力与有效性之间的关键差距,凸显了同时评估两者的基准测试的必要性。
English
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
PDF281December 4, 2025