Demo-ICL:基于上下文学习的程序化视频知识获取
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
February 9, 2026
作者: Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Ziwei Liu
cs.AI
摘要
尽管当前多模态大语言模型(MLLMs)的视频理解能力日益增强,但现有视频基准主要基于模型的静态内部知识进行评估,而非考察其从少量动态新颖语境中学习适应的能力。为弥补这一差距,我们提出演示驱动视频上下文学习这一新任务,重点研究如何通过上下文演示示例来回答目标视频相关问题。同时,我们推出Demo-ICL-Bench——一个专为评估演示驱动视频上下文学习能力设计的挑战性基准。该基准基于1200个含关联问题的YouTube教学视频构建,从中衍生出两类演示:(i) 基于视频字幕生成的文本演示;(ii) 对应的教学视频作为视频演示。为应对这一新挑战,我们开发了Demo-ICL模型,采用两阶段训练策略:视频监督微调与信息辅助直接偏好优化,共同增强模型从上下文示例中学习的能力。通过对前沿MLLMs的广泛实验,我们验证了Demo-ICL-Bench的挑战性,证明了Demo-ICL的有效性,从而揭示了未来研究方向。
English
Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.