ChatPaper.aiChatPaper

StreamGaze:流媒体视频中视线引导的时序推理与主动理解

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

December 1, 2025
作者: Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal
cs.AI

摘要

流式视频理解不仅要求模型能处理时序输入的帧序列,更需在AR眼镜等现实应用中预判用户意图。尽管现有流式基准测试能评估时序推理能力,但尚未衡量多模态大语言模型在流式场景下解读或利用人类视线信号的能力。为填补这一空白,我们推出首个基准测试StreamGaze,专门评估MLLMs如何有效运用视线信号进行流式视频中的时序与主动推理。StreamGaze通过视线引导的过去、现在及前瞻性任务,全面评估流式视频理解能力。这些任务检验模型是否能利用实时视线追踪注意力转移,并仅基于已观测帧和当前帧推断用户意图。 为构建StreamGaze,我们开发了视线-视频问答生成流程,通过注视点提取、区域特异性视觉提示和扫描路径构建,将第一人称视角视频与原始视线轨迹对齐。该流程生成具有时空锚定性的问答对,精准反映人类感知动态。在所有StreamGaze任务中,我们发现顶尖MLLMs与人类表现存在显著差距,揭示了现有模型在基于视线的时序推理、意图建模和主动预测方面的根本局限。我们进一步深入分析了视线提示策略、推理行为及任务特异性失败模式,为当前MLLMs的不足提供根源性解读,并指明未来模型需发展的关键能力。所有数据与代码将公开释放,以持续支持视线引导的流式视频理解研究。
English
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
PDF51December 3, 2025