視頻-SALMONN-o1:增強推理的視聽大型語言模型
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
February 17, 2025
作者: Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun MA, Chao Zhang
cs.AI
摘要
最近推理優化方面的進展顯著增強了大型語言模型(LLMs)的能力,然而現有的改進推理的努力僅限於解決數學問題和專注於視覺圖形輸入,忽略了在一般視頻理解方面的更廣泛應用。本文提出了video-SALMONN-o1,這是第一個針對一般視頻理解任務設計的開源推理增強型音視覺LLM。為了增強其推理能力,我們開發了一個推理密集型數據集,其中包含具有挑戰性的音視覺問題及逐步解決方案。我們還提出了過程直接偏好優化(pDPO),利用對比步驟選擇來實現針對多模態輸入的高效步驟級獎勵建模。此外,我們引入了RivaBench,這是第一個推理密集型視頻理解基準,包括超過4,000個高質量、專家精心策劃的問答對,涵蓋諸如脫口秀喜劇、學術演講和合成視頻檢測等場景。video-SALMONN-o1在不同視頻推理基準測試中相對於LLaVA-OneVision基線實現了3-8%的準確度改進。此外,pDPO在RivaBench上相對於監督微調模型實現了6-8%的改進。增強的推理使得video-SALMONN-o1具有零樣本合成視頻檢測能力。
English
While recent advancements in reasoning optimization have significantly
enhanced the capabilities of large language models (LLMs), existing efforts to
improve reasoning have been limited to solving mathematical problems and
focusing on visual graphical inputs, neglecting broader applications in general
video understanding.This paper proposes video-SALMONN-o1, the first open-source
reasoning-enhanced audio-visual LLM designed for general video understanding
tasks. To enhance its reasoning abilities, we develop a reasoning-intensive
dataset featuring challenging audio-visual questions with step-by-step
solutions. We also propose process direct preference optimization (pDPO), which
leverages contrastive step selection to achieve efficient step-level reward
modelling tailored for multimodal inputs. Additionally, we introduce RivaBench,
the first reasoning-intensive video understanding benchmark, featuring over
4,000 high-quality, expert-curated question-answer pairs across scenarios such
as standup comedy, academic presentations, and synthetic video detection.
video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision
baseline across different video reasoning benchmarks. Besides, pDPO achieves
6-8% improvements compared to the supervised fine-tuning model on RivaBench.
Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection
capabilities.Summary
AI-Generated Summary