AI 게임스토어: 인간 게임을 통한 기계 일반 지능의 확장 가능한 개방형 평가

초록

기계 지능을 인간의 광범위한 일반 지능 스펙트럼에 대해 엄격하게 평가하는 것은 기술이 빠르게 발전하는 현 시대에 점점 더 중요해지고 어려운 과제가 되었습니다. 기존의 AI 벤치마크는 일반적으로 제한된 범위의 인간 활동에서 좁은 능력만을 평가합니다. 또한 대부분은 정적이며, 개발자가 명시적 또는 암묵적으로 벤치마크에 최적화함에 따라 빠르게 포화 상태에 도달합니다. 우리는 AI 시스템에서 인간과 유사한 일반 지능을 평가하는 더 유망한 방법이 특히 강력한 형태의 일반 게임 플레이, 즉 AI 시스템이 모든 상상 가능한 인간 게임을 어떻게, 얼마나 잘 플레이하고 배우는지를 동일한 경험, 시간 또는 기타 자원을 가진 인간 플레이어와 비교하여 연구하는 것이라고 제안합니다. 우리는 "인간 게임"을 인간이 인간을 위해 설계한 게임으로 정의하며, 사람들이 상상하고 즐길 수 있는 모든 그러한 게임들의 공간인 "인간 게임의 다중우주"의 평가 적합성을 주장합니다. 이러한 비전을 향한 첫걸음으로, 우리는 AI GameStore를 소개합니다. 이는 확장 가능하고 개방형 플랫폼으로, 인간이 참여하는 루프와 LLM을 활용하여 인기 있는 인간 디지털 게임 플랫폼에서 표준화되고 컨테이너화된 게임 환경 변형을 자동으로 수집 및 적용하여 새로운 대표적인 인간 게임을 합성합니다. 개념 증명으로, 우리는 Apple App Store와 Steam의 인기 차트를 기반으로 100개의 이러한 게임을 생성하고, 단기 플레이 에피소드에 대해 7개의 최첨단 시각-언어 모델(VLM)을 평가했습니다. 최고 성능 모델들도 대부분의 게임에서 인간 평균 점수의 10% 미만을 달성했으며, 특히 세계 모델 학습, 기억 및 계획 능력을考验하는 게임에서 어려움을 겪었습니다. 우리는 AI GameStore를 인간과 유사한 일반 지능으로의 진전을 측정하고 추진하는 실용적인 방법으로 구축하기 위한 다음 단계들을 제시하며 결론을 맺습니다.

English

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play all conceivable human games, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

AI 게임스토어: 인간 게임을 통한 기계 일반 지능의 확장 가능한 개방형 평가

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

초록

Support