視頻-SALMONN-o1：增強推理的視聽大型語言模型

摘要

最近推理優化方面的進展顯著增強了大型語言模型（LLMs）的能力，然而現有的改進推理的努力僅限於解決數學問題和專注於視覺圖形輸入，忽略了在一般視頻理解方面的更廣泛應用。本文提出了video-SALMONN-o1，這是第一個針對一般視頻理解任務設計的開源推理增強型音視覺LLM。為了增強其推理能力，我們開發了一個推理密集型數據集，其中包含具有挑戰性的音視覺問題及逐步解決方案。我們還提出了過程直接偏好優化（pDPO），利用對比步驟選擇來實現針對多模態輸入的高效步驟級獎勵建模。此外，我們引入了RivaBench，這是第一個推理密集型視頻理解基準，包括超過4,000個高質量、專家精心策劃的問答對，涵蓋諸如脫口秀喜劇、學術演講和合成視頻檢測等場景。video-SALMONN-o1在不同視頻推理基準測試中相對於LLaVA-OneVision基線實現了3-8%的準確度改進。此外，pDPO在RivaBench上相對於監督微調模型實現了6-8%的改進。增強的推理使得video-SALMONN-o1具有零樣本合成視頻檢測能力。

English

While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.