온라인 일반 이벤트 경계 탐지

초록

일반적 이벤트 경계 탐지(Generic Event Boundary Detection, GEBD)는 긴 형식의 비디오를 인간의 인지 관점에서 해석하는 것을 목표로 합니다. 그러나 현재의 GEBD 방법들은 예측을 위해 전체 비디오 프레임을 처리해야 하며, 이는 실시간으로 데이터를 처리하는 인간의 방식과는 다릅니다. 이러한 차이를 해소하기 위해, 우리는 새로운 과제인 온라인 일반적 이벤트 경계 탐지(Online Generic Event Boundary Detection, On-GEBD)를 제안합니다. 이 과제는 스트리밍 비디오에서 즉각적으로 일반적 이벤트의 경계를 탐지하는 것을 목표로 합니다. 이 과제는 미래 프레임에 접근할 수 없는 상태에서 실시간으로 미묘하고 분류체계가 없는 이벤트 변화를 식별해야 하는 독특한 도전에 직면합니다. 이러한 도전을 해결하기 위해, 우리는 이벤트 분할 이론(Event Segmentation Theory, EST)에서 영감을 받은 새로운 On-GEBD 프레임워크인 Estimator를 제안합니다. EST는 인간이 예측된 정보와 실제 정보 간의 차이를 활용하여 진행 중인 활동을 이벤트로 분할하는 방식을 설명합니다. 우리의 프레임워크는 두 가지 주요 구성 요소로 이루어져 있습니다: 일관된 이벤트 예측기(Consistent Event Anticipator, CEA)와 온라인 경계 판별기(Online Boundary Discriminator, OBD). 구체적으로, CEA는 이전 프레임만을 기반으로 현재 이벤트 동역학을 반영한 미래 프레임 예측을 생성합니다. 그런 다음, OBD는 예측 오류를 측정하고 과거 오류에 대한 통계적 테스트를 사용하여 다양한 미묘한 이벤트 전환을 포착하기 위해 임계값을 적응적으로 조정합니다. 실험 결과는 Estimator가 최근의 온라인 비디오 이해 모델에서 적응된 모든 기준선을 능가하며, Kinetics-GEBD 및 TAPOS 데이터셋에서 기존의 오프라인-GEBD 방법과 비슷한 성능을 달성함을 보여줍니다.

English

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.