视频游戏兔子：迈向视频游戏的视觉助手

摘要

大型多模态模型（LMMs）在各个领域具有巨大潜力，从日常任务的个人辅助到医学诊断等复杂应用。然而，它们在视频游戏领域的能力存在一些局限，比如在场景理解、幻觉和对视频游戏内容的不准确描述方面存在挑战，尤其是在开源模型中。本文描述了VideoGameBunny的开发，这是一种基于Bunny的LLaVA风格模型，专门用于理解来自视频游戏的图像。我们发布了中间检查点、训练日志以及一个包含来自413个标题的185,259个视频游戏图像的广泛数据集，以及包括图像标题、问答对和136,974个图像的16个元素的JSON表示的389,565个图像指令对。我们的实验表明，我们高质量的与游戏相关的数据有潜力使一个相对较小的模型胜过参数数量超过4倍的最先进模型LLaVa-1.6-34b。我们的研究为未来在视频游戏理解方面的研究铺平了道路，例如玩游戏、评论和调试等任务。代码和数据可在https://videogamebunny.github.io/获取。

English

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/

视频游戏兔子：迈向视频游戏的视觉助手

VideoGameBunny: Towards vision assistants for video games

摘要

Support