ChatPaper.aiChatPaper

TTRV:視覺語言模型的測試時強化學習

TTRV: Test-Time Reinforcement Learning for Vision Language Models

October 8, 2025
作者: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
cs.AI

摘要

現有強化學習中提取獎勵信號的方法通常依賴於標記數據和專門的訓練分割,這種設置與人類直接從環境中學習的方式形成對比。在本研究中,我們提出了TTRV(測試時強化視覺理解)方法,通過在推理時動態調整模型來增強視覺語言理解,而無需任何標記數據。具體而言,我們基於基礎模型輸出的頻率設計獎勵,並對每個測試樣本進行多次推理,從而改進了群體相對策略優化(GRPO)框架。此外,我們還提出通過同時獎勵模型獲得輸出經驗分佈的低熵來控制模型輸出的多樣性。我們的方法在物體識別和視覺問答(VQA)任務中均取得了穩定的提升,分別最高提升了52.4%和29.8%,並在16個數據集上平均提升了24.6%和10.0%。值得注意的是,在圖像識別任務中,應用於InternVL 8B的TTRV在8個基準測試上平均超越了GPT-4o 2.3%,同時在VQA任務中保持高度競爭力,這表明測試時強化學習能夠匹配甚至超越最強的專有模型。最後,我們發現了測試時強化學習在視覺語言模型中的許多有趣特性:例如,即使在極度數據受限的情況下,僅對一個隨機選擇的未標記測試樣本進行適應,TTRV仍能在識別任務中帶來高達5.5%的非平凡提升。
English
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
PDF112October 9, 2025