TTRV: 視覚言語モデルのためのテスト時強化学習

要旨

強化学習における報酬信号の抽出に関する既存の手法は、通常、ラベル付きデータと専用のトレーニング分割に依存しており、これは人間が環境から直接学習する方法とは対照的である。本研究では、ラベル付きデータを必要とせず、推論時にモデルを動的に適応させることで、視覚言語理解を強化するTTRVを提案する。具体的には、Group Relative Policy Optimization (GRPO) フレームワークを強化し、ベースモデルの出力頻度に基づいて報酬を設計するとともに、各テストサンプルに対して複数回の推論を行う。さらに、出力の経験分布のエントロピーを低くすることをモデルに報酬として与えることで、モデルの出力の多様性を制御することを提案する。我々のアプローチは、物体認識と視覚質問応答 (VQA) の両方で一貫した改善をもたらし、それぞれ最大52.4%と29.8%、16のデータセット全体で平均24.6%と10.0%の向上を達成した。特に、画像認識において、InternVL 8Bに適用したTTRVは、8つのベンチマークでGPT-4oを平均2.3%上回り、VQAにおいても高い競争力を維持し、テストタイム強化学習が最も強力なプロプライエタリモデルに匹敵またはそれを超えることを示している。最後に、VLMに対するテストタイム強化学習の多くの興味深い特性を発見した。例えば、極端にデータが制限されたシナリオにおいても、ランダムに選択された1つのラベルなしテスト例で適応を行った場合、TTRVは認識タスクで最大5.5%の非自明な改善をもたらすことが確認された。

English

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

TTRV: 視覚言語モデルのためのテスト時強化学習

TTRV: Test-Time Reinforcement Learning for Vision Language Models

要旨

Support