観察・記憶・推論：MLLMによる人間視点の動画理解

要旨

ビデオ理解は、マルチモーダル大規模言語モデル（MLLMs）によって急速に変革されつつあり、研究は短いクリップから長時間のマルチモーダルかつ知識集約的なビデオシナリオへと移行している。これらのシナリオでは、モデルが疎な証拠、長距離依存関係、マルチモーダルアライメント、そして限られた計算予算下での信頼性の高い推論を処理することが求められる。本稿では、LLMを基盤としたビデオ理解に対して、人間の視点に基づく考察を提示し、それを「視聴」「記憶」「推論」という三つの機能的能力に整理する。ビデオタスクを個別のベンチマークとして扱うのではなく、この視点は、ビデオMLLMがどのように証拠を取得し、コンテキストを保持し、根拠のある出力を生成するかを分析するための統一的な枠組みを提供する。我々は、ビデオ理解システムを知覚表現、記憶状態、推論トレース、最終予測によって特徴づける定式化を導入する。この定式化に基づき、時空間知覚、効率的な長時間ビデオ処理、記憶モデリング、ストリーミング理解、そして忠実な推論における課題を特定する。代表的な手法は、ビデオMLLMシステムにおける役割に応じて整理される。「視聴」は、詳細な、包括的な、音声-視覚的な、そして効率的な知覚を扱う。「記憶」は、オフラインおよびストリーミングメモリを対象とし、「推論」はテキストのみの推論とビデオを用いた思考を扱う。さらに、一人称視点、スポーツ、教育用、医療、ナラティブビデオなどの応用領域を検討し、タスクタイプ、教師形式、モダリティ、能力次元にわたるトレーニングデータセットと評価ベンチマークを網羅する。最後に、スケーラブルでメモリ認識型かつ根拠に基づくビデオ知能のための未解決問題と将来の方向性を概説する。関連研究は、https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding で継続的に追跡される。

English

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.