관찰, 기억, 추론: MLLMs를 활용한 인간 시점 비디오 이해

초록

비디오 이해는 멀티모달 대규모 언어 모델(MLLM)에 의해 빠르게 변화하고 있으며, 연구는 짧은 클립에서 길고, 멀티모달이며, 지식 집약적인 비디오 시나리오로 확장되고 있다. 이러한 시나리오에서는 모델이 제한된 계산 예산 하에서 희소 증거, 장거리 의존성, 멀티모달 정렬, 신뢰할 수 있는 추론을 처리해야 한다. 본 연구는 LLM 기반 비디오 이해에 대한 인간 관점의 시각을 제시하며, 이를 시청, 기억, 추론이라는 세 가지 기능적 능력으로 구성한다. 비디오 작업을 고립된 벤치마크로 다루기보다, 이 관점은 비디오 MLLM이 증거를 획득하고, 맥락을 유지하며, 근거 있는 출력을 생성하는 방식을 분석하기 위한 통합 구조를 제공한다. 우리는 비디오 이해 시스템을 지각적 표현, 메모리 상태, 추론 과정, 최종 예측으로 특성화하는 정식화를 도입한다. 이 정식화를 바탕으로 시공간 지각, 효율적인 장편 비디오 처리, 메모리 모델링, 스트리밍 이해, 신뢰할 수 있는 추론에서의 과제를 식별한다. 대표적인 방법들은 비디오 MLLM 시스템에서의 역할에 따라 정리된다. 시청은 세밀하고 포괄적이며, 시청각적이고 효율적인 지각을 다룬다. 기억은 오프라인 및 스트리밍 메모리를 포함하며, 추론은 텍스트 전용 추론과 비디오를 통한 사고를 다룬다. 또한 자아 중심, 스포츠, 교육, 의료, 서사 비디오와 같은 응용 도메인을 살펴보고, 작업 유형, 감독 형식, 모달리티, 능력 차원에 걸친 학습 데이터셋과 평가 벤치마크를 다룬다. 마지막으로, 확장 가능하고 메모리 인식적이며 증거 기반의 비디오 지능을 위한 미해결 문제와 미래 방향을 제시한다. 관련 연구는 https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding에서 지속적으로 추적될 것이다.

English

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.