ChatPaper.aiChatPaper

PVChat:基于单样本学习的个性化视频聊天

PVChat: Personalized Video Chat with One-Shot Learning

March 21, 2025
作者: Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yuchen Li, Zhenxi Li, Fei Richard Yu, Ming Li, Si Yong Yeo
cs.AI

摘要

视频大语言模型(ViLLMs)在通用视频理解方面表现出色,例如识别交谈、进食等活动,但在身份感知理解方面存在局限,如“威尔逊正在接受化疗”或“汤姆正在与莎拉讨论”,这限制了其在智能医疗和智能家居环境中的应用。为克服这一局限,我们提出了PVChat,一种一次性学习框架,这是首个能够基于单个视频实现主体感知问答(QA)的个性化ViLLM。我们的方法在合成增强的视频-QA数据集上优化了混合头(MoH)增强的ViLLM,采用了渐进式图像到视频学习策略。具体而言,我们引入了一个自动化增强流程,该流程合成保留身份的正样本,并从现有视频库中检索难负样本,生成包含存在性、外观、动作和位置查询四种QA类型的多样化训练数据集。为加强主体特定学习,我们提出了ReLU路由MoH注意力机制,并伴随两个新颖目标:(1) 通过指数距离缩放实现渐进学习的平滑邻近正则化,以及(2) 平衡注意力路由的头激活增强。最后,我们采用两阶段训练策略,从图像预训练过渡到视频微调,实现了从静态属性到动态表征的渐进学习过程。我们在涵盖医疗场景、电视剧、动画及现实世界片段的多数据集上评估了PVChat,证明其在仅学习单个视频后,在个性化特征理解方面相较于最先进的ViLLMs具有显著优势。
English
Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.

Summary

AI-Generated Summary

PDF72March 24, 2025