視頻實時推理系統：重新思考強化學習與測試時縮放策略，以實現高效且優化的視頻推理能力

摘要

尽管基于强化学习（RL）与大型语言模型（LLMs）的视频推理技术取得了进展，数据收集与微调仍是重大挑战。此类方法通常依赖于大规模监督微调（SFT），需要大量视频数据及长链思维（CoT）标注，导致成本高昂且难以扩展。为此，我们提出了Video-RTS，一种通过结合数据高效的RL与视频自适应测试时缩放（TTS）策略，显著提升数据效率以增强视频推理能力的新方法。基于对RL样本数据缩放规律的观察，我们跳过了资源密集型的SFT步骤，采用基于输出的奖励进行高效纯RL训练，无需额外标注或大规模微调。此外，为更高效利用计算资源，我们引入了一种从稀疏到密集的视频TTS策略，该策略根据输出一致性迭代增加帧数，从而优化推理过程。我们在多个视频推理基准上验证了该方法的有效性，结果显示，Video-RTS仅使用3.6%的训练样本，在准确率上平均超越现有视频推理模型2.4%。例如，在近期且具挑战性的视频推理基准Video-Holmes上，Video-RTS实现了4.2%的提升，在MMVU上提升了2.6%。值得注意的是，我们的纯RL训练与自适应视频TTS策略相辅相成，共同支撑了Video-RTS强大的推理性能。

English

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

視頻實時推理系統：重新思考強化學習與測試時縮放策略，以實現高效且優化的視頻推理能力

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

摘要

Support