InnoGym: AI 에이전트의 혁신 잠재력 벤치마킹

초록

LLM과 에이전트는 코드 생성, 수학적 추론, 과학적 발견 분야에서 인상적인 진전을 이루었습니다. 그러나 기존 벤치마크는 주로 정확도를 측정하며, 해결책 뒤에 숨겨진 방법론의 다양성을 간과하고 있습니다. 진정한 혁신은 정답을 도출하는 것뿐만 아니라 접근법의 독창성에도 달려 있습니다. 본 논문은 AI 에이전트의 혁신 잠재력을 체계적으로 평가하기 위해 최초로 설계된 벤치마크이자 프레임워크인 InnoGym을 소개합니다. InnoGym은 상호 보완적인 두 가지 지표, 즉 기존 최적 솔루션 대비 개선 정도를 측정하는 성능 이득(performance gain)과 기존 접근법과의 방법론적 차이를 포착하는 신규성(novelty)을 제안합니다. 이 벤치마크는 실제 엔지니어링 및 과학 분야에서 엄선된 18개 과제를 포함하며, 각 과제는 자원 필터링, 평가자 검증, 솔루션 수집을 통해 표준화되었습니다. 또한 재현 가능하고 장기적인 평가를 위한 통합 실행 환경인 iGym을 제공합니다. 대규모 실험 결과, 일부 에이전트가 새로운 접근법을 생성할 수 있지만, 견고성이 부족하여 성능 이득을 제한하는 것으로 나타났습니다. 이러한 결과는 창의성과 실효성 사이의 중요한 간극을 부각시키며, 두 가지를 모두 평가하는 벤치마크의 필요성을 강조합니다.

English

LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.

InnoGym: AI 에이전트의 혁신 잠재력 벤치마킹

InnoGym: Benchmarking the Innovation Potential of AI Agents

초록

Support