ChatPaper.aiChatPaper

思维流是否重要?评估Gemini视觉语言模型在视频场景理解中的推理能力

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

April 13, 2026
作者: Shivam Sharma, Sankalp Nagaonkar, Ashish Choithani, Ashutosh Trivedi
cs.AI

摘要

我们针对内部推理轨迹(称之为思维流)如何影响视觉语言模型的视频场景理解进行了基准测试。通过使用谷歌Gemini 2.5 Flash及Flash Lite的四种配置对100小时视频中提取的场景进行分析,我们探究了三个核心问题:更多思考是否带来更优输出、增益效果何时达到瓶颈、以及模型实际关注哪些内容。我们引入了三项评估指标:内容充实度衡量思维流中有用场景内容与元评论的比例;思维-最终输出覆盖度评估思维流转化为最终输出的忠实程度;主导实体分析识别模型关注的主体、动作及场景要素。GPT-5作为独立评判工具参与评估。研究发现:额外思考带来的质量提升会快速进入平台期,主要改进出现在前几百个标记内;Flash Lite在质量与标记消耗间实现了最佳平衡;过紧的推理预算会导致模型在最终输出中添加未经推理的内容,形成一种压缩步骤幻觉。尽管属于不同层级模型,Flash与Flash Lite产生的思维流高度相似,但风格迥异:Flash会阐述其推理过程,而Lite则侧重于场景描述。
English
We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.
PDF41April 16, 2026