思考ストリームは重要か？映像シーン理解におけるGemini視覚言語モデルの推論能力評価

要旨

我々は、思考ストリームと呼ぶ内部推論トレースが、視覚言語モデルにおける映像シーン理解に与える影響をベンチマークする。100時間の映像から抽出したシーンに対して、GoogleのGemini 2.5 FlashおよびFlash Liteの4構成を用い、3つの問いに答える：より多くの思考はより良い出力につながるか、効果が頭打ちになるポイントはどこか、これらのモデルは実際に何について考えているのか。我々は3つの評価指標を導入する。内容充実度は、思考ストリームのうち有用なシーン内容とメタ解説の割合を測定する。思考-最終出力カバレッジは、思考ストリームが最終出力にどれだけ忠実に反映されるかを測定する。主要実体分析は、モデルが焦点を当てる主体、動作、設定を特定する。GPT-5を独立した評価器として用いる。追加的な思考による品質向上は急速に頭打ちになり、大半の改善は最初の数百トークンで発生することを発見した。Flash Liteは品質とトークン使用量のバランスが最も優れている。厳しい推論予算は、モデルが推論していない内容を最終出力で追加する「圧縮段階における虚構」を引き起こす。異なるモデル階層であるにもかかわらず、FlashとFlash Liteは思考ストリームが類似しているが、スタイルが異なる：Flashは推論プロセスを説明するのに対し、Liteはシーンの記述に集中する。

English

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

思考ストリームは重要か？映像シーン理解におけるGemini視覚言語モデルの推論能力評価

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

要旨

Support