PVChat:基於一次性學習的個性化視訊聊天
PVChat: Personalized Video Chat with One-Shot Learning
March 21, 2025
作者: Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yuchen Li, Zhenxi Li, Fei Richard Yu, Ming Li, Si Yong Yeo
cs.AI
摘要
視頻大型語言模型(ViLLMs)在通用視頻理解方面表現出色,例如識別說話和進食等活動,但在身份感知理解方面存在困難,如「Wilson正在接受化療」或「Tom正在與Sarah討論」,這限制了其在智能醫療和智能家居環境中的應用。為了解決這一限制,我們提出了一種一次性學習框架PVChat,這是首個能夠從每個主體的單個視頻中進行主體感知問答(QA)的個性化ViLLM。我們的方法在合成增強的視頻-QA數據集上優化了混合頭部(MoH)增強的ViLLM,利用了一種漸進的圖像到視頻學習策略。具體來說,我們引入了一個自動化增強管道,該管道合成了保留身份的正樣本,並從現有視頻語料庫中檢索困難負樣本,生成了一個包含四種QA類型的多樣化訓練數據集:存在性、外觀、動作和位置查詢。為了增強特定主體的學習,我們提出了一種ReLU路由MoH注意力機制,以及兩個新穎的目標:(1) 通過指數距離縮放實現漸進學習的平滑接近正則化,和(2) 平衡注意力路由的頭部激活增強。最後,我們採用了一種兩階段訓練策略,從圖像預訓練過渡到視頻微調,實現了從靜態屬性到動態表示的漸進學習過程。我們在多樣化的數據集上評估了PVChat,涵蓋了醫療場景、電視劇、動畫和現實世界鏡頭,展示了其在從單個視頻學習後在個性化特徵理解方面的優越性,相比於最先進的ViLLMs。
English
Video large language models (ViLLMs) excel in general video understanding,
e.g., recognizing activities like talking and eating, but struggle with
identity-aware comprehension, such as "Wilson is receiving chemotherapy" or
"Tom is discussing with Sarah", limiting their applicability in smart
healthcare and smart home environments. To address this limitation, we propose
a one-shot learning framework PVChat, the first personalized ViLLM that enables
subject-aware question answering (QA) from a single video for each subject. Our
approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically
augmented video-QA dataset, leveraging a progressive image-to-video learning
strategy. Specifically, we introduce an automated augmentation pipeline that
synthesizes identity-preserving positive samples and retrieves hard negatives
from existing video corpora, generating a diverse training dataset with four QA
types: existence, appearance, action, and location inquiries. To enhance
subject-specific learning, we propose a ReLU Routing MoH attention mechanism,
alongside two novel objectives: (1) Smooth Proximity Regularization for
progressive learning through exponential distance scaling and (2) Head
Activation Enhancement for balanced attention routing. Finally, we adopt a
two-stage training strategy, transitioning from image pre-training to video
fine-tuning, enabling a gradual learning process from static attributes to
dynamic representations. We evaluate PVChat on diverse datasets covering
medical scenarios, TV series, anime, and real-world footage, demonstrating its
superiority in personalized feature understanding after learning from a single
video, compared to state-of-the-art ViLLMs.Summary
AI-Generated Summary