VAU-R1：通过强化微调推进视频异常理解

摘要

视频异常理解（VAU）对于智慧城市、安全监控和灾害预警系统等应用至关重要，但由于其对细粒度时空感知能力的要求以及在模糊情境下的稳健推理需求，这一任务仍具挑战性。尽管异常检测技术已取得进展，现有方法往往缺乏可解释性，难以捕捉异常事件的因果与上下文关联。这一局限因缺乏全面评估异常场景下推理能力的基准而进一步加剧。为应对这两大挑战，我们提出了VAU-R1，一个基于多模态大语言模型（MLLMs）的数据高效框架，通过强化微调（RFT）增强异常推理能力。此外，我们推出了VAU-Bench，这是首个专为视频异常推理设计的链式思维基准，包含多选题、详细推理过程、时间标注及描述性字幕。实验结果表明，VAU-R1在多种情境下显著提升了问答准确性、时间定位能力及推理连贯性。我们的方法与基准共同为可解释且注重推理的视频异常理解奠定了坚实基础。代码已发布于https://github.com/GVCLab/VAU-R1。

English

Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

VAU-R1：通过强化微调推进视频异常理解

VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

摘要

Support