PEARL：個人化串流影片理解模型

摘要

人類對新概念的認知本質上是一個串流處理過程：我們持續識別新物件或身份，並隨著時間推移更新記憶。然而現有的多模態個性化方法大多侷限於靜態圖像或離線影片，這使得連續視覺輸入與即時現實回饋脫節，限制了其提供未來AI助手所需的即時互動式個性化回應能力。為彌合這一差距，我們首次提出並正式定義了「個性化串流影片理解」（PSVU）這一新穎任務。為推動該新方向的研究，我們推出首個專為評估此挑戰性設定而設計的綜合基準PEARL-Bench，其透過兩種模式評估模型在精確時間點回應個性化概念的能力：（1）幀級模式：聚焦離散幀中的特定人物或物件；（2）創新影片級模式：關注跨連續幀展開的個性化動作。PEARL-Bench包含132支獨特影片及2,173個帶精確時間戳的細粒度標註，透過自動生成與人工驗證相結合的流程嚴格確保概念多樣性與標註品質。為應對此挑戰性新設定，我們進一步提出即插即用、無需訓練的策略PEARL作為強基線模型。對8個離線與線上模型的廣泛評估表明，PEARL實現了最先進的性能，尤其當應用於3種不同架構時均能帶來一致的PSVU提升，證明其為高效且穩健的策略。我們期望此工作能推動視覺語言模型（VLM）的個性化發展，並激發對串流式個性化AI助手的進一步研究。程式碼已開源於：https://github.com/Yuanhong-Zheng/PEARL。

English

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

PEARL：個人化串流影片理解模型

PEARL: Personalized Streaming Video Understanding Model

摘要

Support