PEARL：个性化流媒体视频理解模型

摘要

人类对新概念的认知本质上是一个流式过程：我们会持续识别新物体或身份，并随时间推移更新记忆。然而当前的多模态个性化方法大多局限于静态图像或离线视频，这使得连续视觉输入与即时现实反馈相脱节，限制了其提供未来AI助手所需的实时交互式个性化响应能力。为弥合这一差距，我们首次提出并正式定义了"个性化流式视频理解"（PSVU）这一新任务。为推进该方向研究，我们构建了PEARL-Bench——首个专门针对此挑战性场景设计的综合基准。该基准通过两种模式评估模型在精确时间戳响应个性化概念的能力：（1）帧级模式，关注离散帧中的特定人物或物体；（2）创新的视频级模式，关注连续帧中展开的个性化动作。PEARL-Bench包含132个独特视频和2,173个带精确时间戳的细粒度标注，通过自动化生成与人工验证相结合的流程严格确保概念多样性和标注质量。针对这一挑战性新场景，我们进一步提出PEARL——一种即插即用、无需训练的强基线策略。对8个离线与在线模型的广泛评估表明，PEARL实现了最先进的性能。值得注意的是，该策略在应用于3种不同架构时均能带来一致的PSVU性能提升，证明了其高效性与鲁棒性。我们期待这项工作能推动视觉语言模型（VLM）的个性化研究，并启发更多关于流式个性化AI助手的探索。代码已开源：https://github.com/Yuanhong-Zheng/PEARL。

English

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

PEARL：个性化流媒体视频理解模型

PEARL: Personalized Streaming Video Understanding Model

摘要

Support