ChatPaper.aiChatPaper

统一强化与模仿学习在视觉-语言模型中的应用

Unified Reinforcement and Imitation Learning for Vision-Language Models

October 22, 2025
作者: Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
cs.AI

摘要

视觉-语言模型(VLMs)已取得显著进展,但其庞大的规模往往使其在资源受限的环境中难以实用。本文提出了一种新颖且高效的训练算法——统一强化与模仿学习(RIL),旨在构建强大而轻量级的VLMs。RIL独特地结合了强化学习与对抗性模仿学习的优势,使得小型学生VLMs不仅能模仿大型教师模型的复杂文本生成,还能通过强化信号系统性地提升其生成能力。我们模仿框架的核心是一个基于大语言模型(LLM)的判别器,它能够精准区分学生与教师的输出,并辅以多个大型教师VLMs的指导,确保学习的多样性。这一融合强化与模仿的统一学习策略,使学生模型实现了显著的性能提升,使其与领先的闭源VLMs相媲美。在多种视觉-语言基准上的广泛实验表明,RIL显著缩小了与最先进的开源及闭源VLMs之间的性能差距,并在多个实例中超越了它们。
English
Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.
PDF131October 23, 2025