TinyLLaVA-Video-R1：ビデオ推論のための小型LMMに向けて

要旨

近年、強化学習を通じた大規模マルチモーダルモデル（LMMs）の推論能力向上が大きな進展を見せています。しかし、既存研究の多くは数学やコードといった高度な推論を要するデータセットに基づいており、研究者は一般的に大規模モデルを基盤として選択しています。我々は、計算資源が限られた研究者にとって、小規模モデルの推論能力を探求することは依然として価値があると主張します。さらに、一般的な質問応答データセットにおいてモデルがその推論プロセスを説明できるようにすることも同様に意義深いと考えます。そこで、我々は小規模ビデオ推論モデルTinyLLaVA-Video-R1を提案します。これは4Bパラメータ以下のトレーサブルなトレーニングを受けたビデオ理解モデルTinyLLaVA-Videoを基盤としており、一般的なVideo-QAデータセットでの強化学習使用後、推論能力と思考能力が大幅に向上するだけでなく、「アハ体験」という創発的特性も示します。さらに、我々は一連の実験結果を共有し、今後の小規模モデルにおけるビデオ推論（思考）能力の探求に実践的な洞察を提供することを目指しています。本モデルはhttps://github.com/ZhangXJ199/TinyLLaVA-Video-R1で公開されています。

English

Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

TinyLLaVA-Video-R1：ビデオ推論のための小型LMMに向けて

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

要旨

Support