ChatPaper.aiChatPaper

VIOLA:面向最小标注的视频上下文学习新范式

VIOLA: Towards Video In-Context Learning with Minimal Annotations

January 22, 2026
作者: Ryo Fujii, Hideo Saito, Ryo Hachiuma
cs.AI

摘要

將多模態大語言模型(MLLMs)推廣至新興影片領域對實際應用至關重要,但受限於標註數據稀缺而充滿挑戰。儘管情境學習(ICL)提供了一條免訓練的適應路徑,傳統方法依賴大量標註樣本庫,這在工業或手術等專業場景中往往不切實際,因其需專家進行標註。為解決此問題,我們提出VIOLA(最小標註的影片情境學習框架),這一標註高效的框架將少量專家標註與海量未標註數據有效結合。首先,為在嚴格標註預算下最大化效率,我們提出密度不確定性加權抽樣法。有別於可能選取視覺異常值的傳統多樣性或不確定性策略,本方法通過密度估計同時篩選出兼具多樣性、代表性與信息量的樣本。其次,為無噪聲污染地利用剩餘未標註數據,我們構建混合樣本庫並引入置信度感知檢索與置信度感知提示機制。這些機制顯式建模標籤可靠性:基於相似度與置信度的複合分數進行範例檢索,同時使MLLM能自適應區分經過驗證的真實標註與噪聲偽標籤。在四個MLLMs上對九個多樣化基準進行的廣泛實驗表明,本框架在低資源環境下顯著優於多種基線方法,能以最小標註成本實現魯棒適應。
English
Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
PDF31January 24, 2026