VAU-R1：通過強化學習微調提升視頻異常理解能力

摘要

影片異常理解（VAU）對於智慧城市、安全監控和災害預警系統等應用至關重要，但由於其對細粒度時空感知和模糊情境下穩健推理的需求，這項任務仍具挑戰性。儘管異常檢測技術已取得進展，現有方法往往缺乏可解釋性，且難以捕捉異常事件的因果與上下文關聯。這一限制因缺乏全面評估異常情境下推理能力的基準而進一步加劇。為應對這兩大挑戰，我們提出了VAU-R1，這是一個基於多模態大型語言模型（MLLMs）的數據高效框架，通過強化微調（RFT）來增強異常推理能力。此外，我們推出了VAU-Bench，這是首個專為影片異常推理設計的思維鏈基準，包含多選題、詳細推理過程、時間標註和描述性字幕。實驗結果表明，VAU-R1在多樣化情境下顯著提升了問答準確性、時間定位能力和推理連貫性。我們的方法與基準共同為可解釋且具備推理意識的影片異常理解奠定了堅實基礎。我們的程式碼已公開於https://github.com/GVCLab/VAU-R1。

English

Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

VAU-R1：通過強化學習微調提升視頻異常理解能力

VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

摘要

Support