TTRV: 비전 언어 모델을 위한 테스트 시간 강화 학습

초록

강화 학습에서 보상 신호를 추출하기 위한 기존 방법들은 일반적으로 레이블이 지정된 데이터와 전용 훈련 분할에 의존하는데, 이는 인간이 환경에서 직접 학습하는 방식과 대조됩니다. 본 연구에서는 레이블이 지정된 데이터 없이도 추론 시점에 모델을 즉시 적응시켜 시각 언어 이해를 향상시키는 TTRV를 제안합니다. 구체적으로, 우리는 Group Relative Policy Optimization (GRPO) 프레임워크를 개선하여 기본 모델의 출력 빈도를 기반으로 보상을 설계하고, 각 테스트 샘플에 대해 여러 번 추론을 수행합니다. 또한, 출력 경험 분포의 엔트로피가 낮을 때 모델에 보상을 주어 출력의 다양성을 제어하는 방법도 제안합니다. 우리의 접근 방식은 객체 인식과 시각 질의 응답(VQA) 모두에서 일관된 성능 향상을 보여주며, 각각 최대 52.4%와 29.8%의 개선을 달성했고, 16개 데이터셋에서 평균 24.6%와 10.0%의 향상을 보였습니다. 특히, 이미지 인식에서 TTRV를 적용한 InternVL 8B는 GPT-4o를 8개 벤치마크에서 평균 2.3% 앞섰으며, VQA에서도 매우 경쟁력 있는 성능을 보여주어 테스트 시점 강화 학습이 가장 강력한 독점 모델을 능가하거나 맞먹을 수 있음을 입증했습니다. 마지막으로, 우리는 시각 언어 모델(VLM)에 대한 테스트 시점 강화 학습의 여러 흥미로운 특성을 발견했습니다. 예를 들어, 단일 무작위로 선택된 레이블 없는 테스트 예제에서 적응을 수행하는 극도로 데이터가 제한된 시나리오에서도 TTRV는 인식 작업에서 최대 5.5%의 유의미한 개선을 제공했습니다.

English

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

TTRV: 비전 언어 모델을 위한 테스트 시간 강화 학습

TTRV: Test-Time Reinforcement Learning for Vision Language Models

초록

Support