PVChat：基于单样本学习的个性化视频聊天

摘要

视频大语言模型（ViLLMs）在通用视频理解方面表现出色，例如识别交谈、进食等活动，但在身份感知理解方面存在局限，如“威尔逊正在接受化疗”或“汤姆正在与莎拉讨论”，这限制了其在智能医疗和智能家居环境中的应用。为克服这一局限，我们提出了PVChat，一种一次性学习框架，这是首个能够基于单个视频实现主体感知问答（QA）的个性化ViLLM。我们的方法在合成增强的视频-QA数据集上优化了混合头（MoH）增强的ViLLM，采用了渐进式图像到视频学习策略。具体而言，我们引入了一个自动化增强流程，该流程合成保留身份的正样本，并从现有视频库中检索难负样本，生成包含存在性、外观、动作和位置查询四种QA类型的多样化训练数据集。为加强主体特定学习，我们提出了ReLU路由MoH注意力机制，并伴随两个新颖目标：(1) 通过指数距离缩放实现渐进学习的平滑邻近正则化，以及(2) 平衡注意力路由的头激活增强。最后，我们采用两阶段训练策略，从图像预训练过渡到视频微调，实现了从静态属性到动态表征的渐进学习过程。我们在涵盖医疗场景、电视剧、动画及现实世界片段的多数据集上评估了PVChat，证明其在仅学习单个视频后，在个性化特征理解方面相较于最先进的ViLLMs具有显著优势。

English

Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.

PVChat：基于单样本学习的个性化视频聊天

PVChat: Personalized Video Chat with One-Shot Learning

摘要

Support