ChatPaper.aiChatPaper

TTRV:面向视觉语言模型的测试时强化学习

TTRV: Test-Time Reinforcement Learning for Vision Language Models

October 8, 2025
作者: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
cs.AI

摘要

现有强化学习中提取奖励信号的方法通常依赖于标注数据和专门的训练划分,这种设置与人类直接从环境中学习的方式形成鲜明对比。在本研究中,我们提出了TTRV(测试时强化学习)方法,通过在推理时动态调整模型来增强视觉语言理解,且无需任何标注数据。具体而言,我们改进了群体相对策略优化(GRPO)框架,设计基于基础模型输出频率的奖励机制,同时对每个测试样本进行多次推理。此外,我们还提出通过同时奖励模型获得输出经验分布的低熵值来控制模型输出的多样性。我们的方法在物体识别和视觉问答(VQA)任务上均取得了显著提升,分别实现了高达52.4%和29.8%的改进,在16个数据集上的平均提升分别为24.6%和10.0%。值得注意的是,在图像识别任务中,应用于InternVL 8B的TTRV在8个基准测试上平均超越GPT-4o 2.3%,同时在VQA任务上保持高度竞争力,证明了测试时强化学习能够匹配甚至超越最强大的专有模型。最后,我们发现了测试时强化学习在视觉语言模型中的许多有趣特性:例如,即使在数据极度受限的场景下,仅对单个随机选取的未标注测试样本进行适应,TTRV仍能在识别任务中带来高达5.5%的非平凡改进。
English
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
PDF112October 9, 2025