ChatPaper.aiChatPaper

視頻實時推理系統:重新思考強化學習與測試時縮放策略,以實現高效且優化的視頻推理能力

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

July 9, 2025
作者: Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
cs.AI

摘要

尽管基于强化学习(RL)与大型语言模型(LLMs)的视频推理技术取得了进展,数据收集与微调仍是重大挑战。此类方法通常依赖于大规模监督微调(SFT),需要大量视频数据及长链思维(CoT)标注,导致成本高昂且难以扩展。为此,我们提出了Video-RTS,一种通过结合数据高效的RL与视频自适应测试时缩放(TTS)策略,显著提升数据效率以增强视频推理能力的新方法。基于对RL样本数据缩放规律的观察,我们跳过了资源密集型的SFT步骤,采用基于输出的奖励进行高效纯RL训练,无需额外标注或大规模微调。此外,为更高效利用计算资源,我们引入了一种从稀疏到密集的视频TTS策略,该策略根据输出一致性迭代增加帧数,从而优化推理过程。我们在多个视频推理基准上验证了该方法的有效性,结果显示,Video-RTS仅使用3.6%的训练样本,在准确率上平均超越现有视频推理模型2.4%。例如,在近期且具挑战性的视频推理基准Video-Holmes上,Video-RTS实现了4.2%的提升,在MMVU上提升了2.6%。值得注意的是,我们的纯RL训练与自适应视频TTS策略相辅相成,共同支撑了Video-RTS强大的推理性能。
English
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
PDF31July 10, 2025