觀看、記憶、推理：基於MLLMs的人類視角影片理解

摘要

多模態大語言模型（MLLMs）正迅速改變影片理解領域，研究範疇從短片擴展至長篇、多模態且知識密集的影片場景。這些場景要求模型在有限計算資源下，處理稀疏證據、長程依賴、多模態對齊以及可靠的推理。本研究提出一個以人類視角為核心的基於LLM的影片理解觀點，並圍繞三項功能能力組織：觀看、記憶與推理。此觀點不將影片任務視為孤立的基準測試，而是提供一個統一的架構，用以分析影片MLLMs如何獲取證據、保存上下文，以及產出立足於證據的輸出。我們引入一個表徵方式，透過感知表徵、記憶狀態、推理軌跡與最終預測來描述影片理解系統。基於此表徵，我們指出在時空感知、高效長影片處理、記憶建模、串流理解與忠實推理等方面的挑戰。代表性方法依其在影片MLLM系統中的角色進行組織。觀看涵蓋細粒度、全面性、視聽與高效感知。記憶包括離線與串流記憶，而推理則涵蓋純文字推理與結合影片的思考。我們進一步探討應用領域，例如第一人稱視角、運動、教學、醫療與敘事影片，並涵蓋跨任務類型、監督格式、模態與能力維度的訓練資料集與評估基準。最後，我們概述了可擴展、具記憶意識且立足證據的影片智慧所面臨的開放問題與未來方向。相關研究將持續於 https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding 追蹤更新。

English

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.