視頻遊戲兔子：邁向視頻遊戲的視覺助手

摘要

大型多模型（LMMs）在各個領域具有相當大的潛力，從日常任務中的個人助理到複雜應用，如醫學診斷。然而，它們在視頻遊戲領域的能力存在一些限制，例如在場景理解、幻覺和對視頻遊戲內容的不準確描述方面存在挑戰，尤其是在開源模型中。本文描述了VideoGameBunny的開發，這是一個基於Bunny的LLaVA風格模型，專門用於理解來自視頻遊戲的圖像。我們釋出了中間檢查點、訓練日誌，以及一個包含來自413款遊戲的185,259張視頻遊戲圖像和389,565個圖像指令對的廣泛數據集，其中包括圖像標題、問答對和136,974張圖像的16個元素的JSON表示。我們的實驗表明，我們高質量的遊戲相關數據有潛力使一個相對較小的模型勝過擁有超過4倍參數數量的最先進模型LLaVa-1.6-34b。我們的研究為未來在視頻遊戲理解方面的研究鋪平了道路，例如遊玩、評論和調試等任務。代碼和數據可在https://videogamebunny.github.io/上獲得。

English

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/

視頻遊戲兔子：邁向視頻遊戲的視覺助手

VideoGameBunny: Towards vision assistants for video games

摘要

Support