ChatPaper.aiChatPaper

探索強化學習對視頻理解的影響: 來自SEED-Bench-R1的洞見

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

March 31, 2025
作者: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
cs.AI

摘要

近期在思維鏈(Chain of Thought, COT)生成方面的進展,顯著提升了大型語言模型(Large Language Models, LLMs)的推理能力,其中強化學習(Reinforcement Learning, RL)作為一種有效的後訓練方法嶄露頭角。多模態大型語言模型(Multimodal Large Language Models, MLLMs)繼承了這一推理潛力,但在需要感知與邏輯推理結合的任務中仍未被充分探索。為此,我們推出了SEED-Bench-R1,這是一個旨在系統評估MLLMs在視頻理解任務中後訓練方法的基準測試。它包含了複雜的真實世界視頻和日常規劃任務,以多選題的形式呈現,要求模型具備精細的感知與推理能力。SEED-Bench-R1通過三個層次的架構來評估模型的泛化能力:分佈內、跨環境以及跨環境-任務場景,並配備了一個大規模的訓練數據集,其答案易於驗證。以Qwen2-VL-Instruct-7B為基礎模型,我們比較了RL與監督式微調(Supervised Fine-Tuning, SFT),展示了RL在數據效率上的優勢以及在分佈內和分佈外任務上的卓越表現,甚至在如LongVideoBench這樣的通用視頻理解基準上超越了SFT。我們詳細的分析揭示,RL雖增強了視覺感知,但常常生成邏輯連貫性較低的推理鏈。我們指出了關鍵限制,如推理不一致和視覺線索的忽視,並建議未來在基礎模型推理、獎勵建模以及RL對抗噪聲信號的魯棒性方面進行改進。
English
Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

Summary

AI-Generated Summary

PDF383April 2, 2025