思绪流重要吗？评估Gemini视觉语言模型在视频场景理解中的推理能力

摘要

我们通过内部推理轨迹（称为思维流）来评估视觉语言模型在视频场景理解中的表现。基于从100小时视频中提取的场景，采用谷歌Gemini 2.5 Flash与Flash Lite的四种配置方案，我们探究三个核心问题：更多思考是否带来更好输出？性能增益何时趋于饱和？模型实际关注哪些内容？我们引入三项评估指标：内容充实度衡量思维流中有用场景内容与元评论的比例；思维-最终覆盖度评估思维流向最终输出的转化保真度；主导实体分析识别模型关注的主体、动作及场景要素。采用GPT-5作为独立评判器，研究发现：思考深度带来的质量增益快速进入平台期，主要提升集中于前几百个标记；Flash Lite在质量与标记消耗间达到最佳平衡。过紧的推理预算会导致模型在最终输出中添加未经推理的内容，形成压缩阶段幻觉。尽管属于不同层级，Flash与Flash Lite产生相似思维流但风格迥异：Flash侧重阐述推理过程，而Lite专注于场景描述。

English

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

思绪流重要吗？评估Gemini视觉语言模型在视频场景理解中的推理能力

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

摘要

Support