PEARL: 개인 맞춤형 스트리밍 비디오 이해 모델

초록

인간의 새로운 개념에 대한 인식은 본질적으로 스트리밍 과정입니다: 우리는 지속적으로 새로운 객체나 정체성을 인지하고 시간이 지남에 따라 기억을 업데이트합니다. 그러나 현재의 멀티모달 개인화 방법은 대부분 정적 이미지나 오프라인 비디오에 국한되어 있습니다. 이는 연속적인 시각 입력과 실시간 현실 세계 피드백 간의 단절을 초래하며, 미래 AI 어시스턴트에 필수적인 실시간 대화형 개인화 응답 제공 능력을 제한합니다. 이러한 격차를 해소하기 위해 우리는 먼저 새로운 과제인 개인화 스트리밍 비디오 이해(PSVU)를 제안하고 공식적으로 정의합니다. 이 새로운 방향의 연구를 촉진하기 위해 우리는 이 도전적인 설정을 평가하도록 특별히 설계된 첫 번째 포괄적인 벤치마크인 PEARL-Bench를 소개합니다. 이 벤치마크는 모델이 두 가지 모드에서 정확한 타임스탬프에 따라 개인화된 개념에 응답하는 능력을 평가합니다: (1) 개별 프레임에서 특정 사람이나 객체에 초점을 맞춘 프레임 수준, 그리고 (2) 연속적인 프레임에 걸쳐 전개되는 개인화된 행동에 초점을 맞춘 새로운 비디오 수준. PEARL-Bench는 132개의 고유 비디오와 정밀한 타임스탬프가 포함된 2,173개의 세분화된 주석으로 구성됩니다. 개념 다양성과 주석 품질은 자동 생성과 인간 검증을 결합한 파이프라인을 통해 엄격히 보장됩니다. 이 도전적인 새로운 설정을 해결하기 위해 우리는 강력한 베이스라인으로 기능하는 플러그 앤 플레이 방식이며 학습이 필요 없는 전략인 PEARL을 추가로 제안합니다. 8개의 오프라인 및 온라인 모델에 대한 광범위한 평가를 통해 PEARL이 최첨단 성능을 달성함을 입증했습니다. 특히, PEARL은 3개의 서로 다른 아키텍처에 적용될 때 일관된 PSVU 성능 향상을 가져와 매우 효과적이고 강력한 전략임을 입증했습니다. 우리는 이 작업이 시각-언어 모델(VLM) 개인화를 발전시키고 스트리밍 개인화 AI 어시스턴트에 대한 추가 연구를 고무시키기를 바랍니다. 코드는 https://github.com/Yuanhong-Zheng/PEARL에서 확인할 수 있습니다.

English

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

PEARL: 개인 맞춤형 스트리밍 비디오 이해 모델

PEARL: Personalized Streaming Video Understanding Model

초록

Support