PVChat：基於一次性學習的個性化視訊聊天

摘要

視頻大型語言模型（ViLLMs）在通用視頻理解方面表現出色，例如識別說話和進食等活動，但在身份感知理解方面存在困難，如「Wilson正在接受化療」或「Tom正在與Sarah討論」，這限制了其在智能醫療和智能家居環境中的應用。為了解決這一限制，我們提出了一種一次性學習框架PVChat，這是首個能夠從每個主體的單個視頻中進行主體感知問答（QA）的個性化ViLLM。我們的方法在合成增強的視頻-QA數據集上優化了混合頭部（MoH）增強的ViLLM，利用了一種漸進的圖像到視頻學習策略。具體來說，我們引入了一個自動化增強管道，該管道合成了保留身份的正樣本，並從現有視頻語料庫中檢索困難負樣本，生成了一個包含四種QA類型的多樣化訓練數據集：存在性、外觀、動作和位置查詢。為了增強特定主體的學習，我們提出了一種ReLU路由MoH注意力機制，以及兩個新穎的目標：(1) 通過指數距離縮放實現漸進學習的平滑接近正則化，和(2) 平衡注意力路由的頭部激活增強。最後，我們採用了一種兩階段訓練策略，從圖像預訓練過渡到視頻微調，實現了從靜態屬性到動態表示的漸進學習過程。我們在多樣化的數據集上評估了PVChat，涵蓋了醫療場景、電視劇、動畫和現實世界鏡頭，展示了其在從單個視頻學習後在個性化特徵理解方面的優越性，相比於最先進的ViLLMs。

English

Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.

PVChat：基於一次性學習的個性化視訊聊天

PVChat: Personalized Video Chat with One-Shot Learning

摘要

Support