VideoGameBunny: ビデオゲームのための視覚アシスタントに向けて

要旨

大規模マルチモーダルモデル（LMMs）は、日常業務における個人アシスタンスから医療診断のような高度な応用まで、さまざまな領域で大きな可能性を秘めている。しかし、その能力はビデオゲーム領域において限界があり、特にオープンソースモデルでは、シーン理解、幻覚、ビデオゲームコンテンツの不正確な記述といった課題が存在する。本論文では、ビデオゲームの画像理解に特化したBunnyを基にしたLLaVAスタイルのモデルであるVideoGameBunnyの開発について述べる。我々は、中間チェックポイント、トレーニングログ、および413タイトルからなる185,259枚のビデオゲーム画像と、画像キャプション、質問応答ペア、136,974枚の画像の16要素のJSON表現を含む389,565の画像-指示ペアからなる広範なデータセットを公開する。我々の実験では、高品質なゲーム関連データが、パラメータ数が4倍以上多い最先端モデルLLaVa-1.6-34bを上回る比較的小さなモデルの性能を向上させる可能性を示している。本研究は、プレイ、解説、デバッグなどのタスクにおけるビデオゲーム理解の未来の研究の道を開くものである。コードとデータはhttps://videogamebunny.github.io/で利用可能である。

English

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/

VideoGameBunny: ビデオゲームのための視覚アシスタントに向けて

VideoGameBunny: Towards vision assistants for video games

要旨

Support