VideoGameBunny: Auf dem Weg zu Vision-Assistenten für Videospiele

papers.abstract

Große multimodale Modelle (LMMs) bieten beträchtliches Potenzial in verschiedenen Bereichen, von persönlicher Unterstützung bei täglichen Aufgaben bis hin zu anspruchsvollen Anwendungen wie medizinischer Diagnostik. Ihre Fähigkeiten haben jedoch Grenzen im Bereich der Videospiele, wie z.B. Herausforderungen bei der Szenenverarbeitung, Halluzinationen und ungenaue Beschreibungen von Videospielinhalten, insbesondere in Open-Source-Modellen. Dieser Artikel beschreibt die Entwicklung von VideoGameBunny, einem LLaVA-ähnlichen Modell basierend auf Bunny, das speziell für das Verstehen von Bildern aus Videospielen entwickelt wurde. Wir veröffentlichen Zwischenstände, Trainingsprotokolle und einen umfangreichen Datensatz, der 185.259 Videospielbilder aus 413 Titeln umfasst, sowie 389.565 Bild-Anweisungspaare, die Bildunterschriften, Frage-Antwort-Paare und eine JSON-Repräsentation von 16 Elementen von 136.974 Bildern enthalten. Unsere Experimente zeigen, dass unsere hochwertigen spielbezogenen Daten das Potenzial haben, ein relativ kleines Modell dazu zu bringen, das weitaus größere state-of-the-art Modell LLaVa-1.6-34b zu übertreffen (das mehr als 4-mal so viele Parameter hat). Unsere Studie ebnet den Weg für zukünftige Forschung im Bereich des Verständnisses von Videospielen bei Aufgaben wie Spielen, Kommentieren und Debuggen. Code und Daten sind verfügbar unter https://videogamebunny.github.io/

English

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/

VideoGameBunny: Auf dem Weg zu Vision-Assistenten für Videospiele

VideoGameBunny: Towards vision assistants for video games

papers.abstract

Support