基於不確定性加權的圖像-事件多模態融合用於視頻異常檢測
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
May 5, 2025
作者: Sungheon Jeong, Jihong Park, Mohsen Imani
cs.AI
摘要
现有的视频异常检测器大多仅依赖RGB帧,这些帧缺乏捕捉突发或瞬态运动线索所需的时间分辨率,而这些线索正是异常事件的关键指标。为解决这一局限,我们提出了图像-事件融合视频异常检测框架(IEF-VAD),该框架直接从RGB视频中合成事件表示,并通过一种基于不确定性的原则性过程将其与图像特征融合。该系统(i)采用学生t分布似然对重尾传感器噪声建模,通过拉普拉斯近似导出值级逆方差权重;(ii)应用卡尔曼式逐帧更新,以平衡不同模态随时间的变化;(iii)迭代优化融合的潜在状态,以消除残留的跨模态噪声。无需专用事件传感器或帧级标签,IEF-VAD在多个现实世界异常检测基准上树立了新的技术标杆。这些发现凸显了合成事件表示在强调RGB帧中常被忽视的运动线索方面的效用,使得无需专用事件传感器即可实现跨多样应用的准确且鲁棒的视频理解。代码与模型可在https://github.com/EavnJeong/IEF-VAD获取。
English
Most existing video anomaly detectors rely solely on RGB frames, which lack
the temporal resolution needed to capture abrupt or transient motion cues, key
indicators of anomalous events. To address this limitation, we propose
Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that
synthesizes event representations directly from RGB videos and fuses them with
image features through a principled, uncertainty-aware process. The system (i)
models heavy-tailed sensor noise with a Student`s-t likelihood, deriving
value-level inverse-variance weights via a Laplace approximation; (ii) applies
Kalman-style frame-wise updates to balance modalities over time; and (iii)
iteratively refines the fused latent state to erase residual cross-modal noise.
Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new
state of the art across multiple real-world anomaly detection benchmarks. These
findings highlight the utility of synthetic event representations in
emphasizing motion cues that are often underrepresented in RGB frames, enabling
accurate and robust video understanding across diverse applications without
requiring dedicated event sensors. Code and models are available at
https://github.com/EavnJeong/IEF-VAD.Summary
AI-Generated Summary