Video-RTS: 効率的かつ高度なビデオ推論のための強化学習とテスト時スケーリングの再考

要旨

大規模言語モデル（LLM）を用いた強化学習（RL）ベースのビデオ推論が進展しているにもかかわらず、データ収集とファインチューニングは依然として大きな課題です。これらの手法は、大規模な教師ありファインチューニング（SFT）と膨大なビデオデータ、長い連鎖思考（CoT）アノテーションに依存することが多く、コストがかかり、スケーリングが困難です。この問題に対処するため、我々はVideo-RTSを提案します。これは、データ効率の高いRLとビデオ適応型テストタイムスケーリング（TTS）戦略を組み合わせることで、ビデオ推論能力を大幅に向上させる新しいアプローチです。RLサンプルのデータスケーリングに関する観察に基づき、リソース集約的なSFTステップをスキップし、追加のアノテーションや大規模なファインチューニングを必要としない、出力ベースの報酬を用いた効率的な純粋RLトレーニングを採用します。さらに、計算リソースをより効率的に活用するため、出力の一貫性に基づいてフレームを反復的に追加するスパースからデンスへのビデオTTS戦略を導入し、推論を改善します。我々のアプローチを複数のビデオ推論ベンチマークで検証し、Video-RTSが既存のビデオ推論モデルを平均2.4%の精度で上回り、トレーニングサンプルのわずか3.6%しか使用しないことを示しました。例えば、Video-RTSは、最近の挑戦的なビデオ推論ベンチマークであるVideo-Holmesで4.2%、MMVUで2.6%の改善を達成しました。特に、我々の純粋RLトレーニングと適応型ビデオTTSは相補的な強みを提供し、Video-RTSの強力な推論性能を可能にしています。

English

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Video-RTS: 効率的かつ高度なビデオ推論のための強化学習とテスト時スケーリングの再考

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

要旨

Support