VAU-R1:通过强化微调推进视频异常理解
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
May 29, 2025
作者: Liyun Zhu, Qixiang Chen, Xi Shen, Xiaodong Cun
cs.AI
摘要
视频异常理解(VAU)对于智慧城市、安全监控和灾害预警系统等应用至关重要,但由于其对细粒度时空感知能力的要求以及在模糊情境下的稳健推理需求,这一任务仍具挑战性。尽管异常检测技术已取得进展,现有方法往往缺乏可解释性,难以捕捉异常事件的因果与上下文关联。这一局限因缺乏全面评估异常场景下推理能力的基准而进一步加剧。为应对这两大挑战,我们提出了VAU-R1,一个基于多模态大语言模型(MLLMs)的数据高效框架,通过强化微调(RFT)增强异常推理能力。此外,我们推出了VAU-Bench,这是首个专为视频异常推理设计的链式思维基准,包含多选题、详细推理过程、时间标注及描述性字幕。实验结果表明,VAU-R1在多种情境下显著提升了问答准确性、时间定位能力及推理连贯性。我们的方法与基准共同为可解释且注重推理的视频异常理解奠定了坚实基础。代码已发布于https://github.com/GVCLab/VAU-R1。
English
Video Anomaly Understanding (VAU) is essential for applications such as smart
cities, security surveillance, and disaster alert systems, yet remains
challenging due to its demand for fine-grained spatio-temporal perception and
robust reasoning under ambiguity. Despite advances in anomaly detection,
existing methods often lack interpretability and struggle to capture the causal
and contextual aspects of abnormal events. This limitation is further
compounded by the absence of comprehensive benchmarks for evaluating reasoning
ability in anomaly scenarios. To address both challenges, we introduce VAU-R1,
a data-efficient framework built upon Multimodal Large Language Models (MLLMs),
which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT).
Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored
for video anomaly reasoning, featuring multiple-choice QA, detailed rationales,
temporal annotations, and descriptive captions. Empirical results show that
VAU-R1 significantly improves question answering accuracy, temporal grounding,
and reasoning coherence across diverse contexts. Together, our method and
benchmark establish a strong foundation for interpretable and reasoning-aware
video anomaly understanding. Our code is available at
https://github.com/GVCLab/VAU-R1.Summary
AI-Generated Summary