オンライン汎用イベント境界検出

要旨

汎用イベント境界検出（Generic Event Boundary Detection, GEBD）は、長尺動画を人間の知覚に基づいて解釈することを目的としている。しかし、現行のGEBD手法は、人間がデータをオンラインかつリアルタイムで処理するのとは異なり、完全な動画フレームを処理して予測を行う必要がある。このギャップを埋めるため、本論文では新たなタスクとして、ストリーミング動画において即座に汎用イベントの境界を検出することを目指す「オンライン汎用イベント境界検出（Online Generic Event Boundary Detection, On-GEBD）」を提案する。このタスクは、将来のフレームにアクセスすることなく、リアルタイムで微妙かつ分類体系に依存しないイベント変化を識別するという独自の課題に直面する。これらの課題に対処するため、我々はイベントセグメンテーション理論（Event Segmentation Theory, EST）に着想を得た新しいOn-GEBDフレームワーク「Estimator」を提案する。ESTは、人間が予測情報と実際の情報の不一致を利用して進行中の活動をイベントに分割する方法を説明するものである。本フレームワークは、2つの主要なコンポーネントで構成される。1つ目は「一貫性のあるイベント予測器（Consistent Event Anticipator, CEA）」であり、過去のフレームのみに基づいて現在のイベントダイナミクスを反映した将来フレームの予測を生成する。2つ目は「オンライン境界識別器（Online Boundary Discriminator, OBD）」であり、予測誤差を測定し、過去の誤差に対する統計的検定を用いて閾値を適応的に調整することで、多様で微妙なイベント遷移を捉える。実験結果から、Estimatorは最近のオンライン動画理解モデルから適応したすべてのベースラインを上回り、Kinetics-GEBDおよびTAPOSデータセットにおいて、従来のオフラインGEBD手法に匹敵する性能を達成することが示された。

English

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.