ChatPaper.aiChatPaper

VIOLA:面向最小标注的视频上下文学习新框架

VIOLA: Towards Video In-Context Learning with Minimal Annotations

January 22, 2026
作者: Ryo Fujii, Hideo Saito, Ryo Hachiuma
cs.AI

摘要

将多模态大语言模型(MLLMs)泛化至新兴视频领域对实际应用至关重要,但由于标注数据稀缺,这一目标仍面临挑战。虽然情境学习(ICL)提供了一种免训练的适配路径,但传统方法依赖大规模标注库,这在工业或手术等专业场景中往往难以实现,因其需要专家标注。为弥补这一差距,我们提出VIOLA(最小标注视频情境学习框架),该标签高效框架将少量专家监督与海量未标注数据协同整合。首先,为在严格标注预算下实现效率最大化,我们提出密度不确定性加权采样法。与可能选择视觉异常值的传统多样性或不确定性策略不同,本方法通过密度估计同时筛选出兼具多样性、代表性和信息量的样本。其次,为利用剩余未标注数据并避免噪声传播,我们构建混合样本库,引入置信度感知检索与置信度感知提示机制。这些方法显式建模标签可靠性,基于相似度与置信度的复合分数检索示例,并使MLLM能自适应区分经过验证的真实标签与含噪声的伪标签。通过在四个MLLM上对九个多样化基准进行大量实验表明,本框架在低资源环境下显著优于多种基线方法,能以最小标注成本实现鲁棒的领域适配。
English
Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
PDF31January 24, 2026