PEARL: パーソナライズされたストリーミング動画理解モデル

要旨

人間の新概念に対する認知は、本質的にストリーミングプロセスである。私たちは新しい物体やアイデンティティを継続的に認識し、時間の経過とともに記憶を更新していく。しかし、現在のマルチモーダル個人化手法は、静的な画像やオフライン動画に大きく限定されている。これにより、連続的な視覚入力と即時の実世界フィードバックが分断され、将来のAIアシスタントに不可欠なリアルタイムで対話的な個人化応答を提供する能力が制限されている。この隔たりを埋めるため、私たちはまず新規タスク「Personalized Streaming Video Understanding（PSVU：個人化ストリーミング動画理解）」を提案し、正式に定義する。この新たな研究方向の研究を促進するため、この挑戦的な設定を評価するために特別に設計された初の包括的ベンチマーク「PEARL-Bench」を導入する。これは、2つのモードにおいて特定のタイムスタンプで個人化された概念に応答するモデルの能力を評価する：（1）離散フレーム内の特定人物や物体に焦点を当てるフレームレベル、（2）連続フレームにわたって展開される個人化された行動に焦点を当てる新規の動画レベル。PEARL-Benchは132のユニークな動画と2,173の細粒度アノテーション（正確なタイムスタンプ付き）で構成される。概念の多様性とアノテーション品質は、自動生成と人間による検証を組み合わせたパイプラインを通じて厳密に確保されている。この挑戦的な新設定に取り組むため、私たちはさらにPEARLを提案する。これはプラグアンドプレイで訓練不要な戦略であり、強力なベースラインとして機能する。8つのオフラインおよびオンラインモデルを用いた広範な評価により、PEARLが最先端の性能を達成することが実証された。特に、3つの異なるアーキテクチャに適用した場合でも一貫したPSVUの改善をもたらし、非常に効果的かつ堅牢な戦略であることが証明された。本研究が視覚言語モデル（VLM）の個人化を推進し、ストリーミング型個人化AIアシスタントに関するさらなる研究を刺激することを期待する。コードはhttps://github.com/Yuanhong-Zheng/PEARLで公開されている。

English

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

PEARL: パーソナライズされたストリーミング動画理解モデル

PEARL: Personalized Streaming Video Understanding Model

要旨

Support