統一強化學習與模仿學習於視覺語言模型
Unified Reinforcement and Imitation Learning for Vision-Language Models
October 22, 2025
作者: Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
cs.AI
摘要
視覺語言模型(VLMs)已取得顯著進展,但其龐大規模往往使其在資源受限的環境中難以實用。本文介紹了一種新穎且高效的訓練算法——統一強化與模仿學習(RIL),旨在創建強大而輕量級的VLMs。RIL獨特地結合了強化學習與對抗性模仿學習的優勢,使較小的學生VLMs不僅能模仿大型教師模型的複雜文本生成,還能通過強化信號系統性地提升其生成能力。我們模仿框架的關鍵在於一個基於LLM的判別器,它能夠熟練地區分學生與教師的輸出,並輔以多個大型教師VLMs的指導,以確保學習的多樣性。這種統一學習策略,結合了強化與模仿,使學生模型能夠實現顯著的性能提升,使其與領先的閉源VLMs相媲美。在各種視覺語言基準上的廣泛實驗表明,RIL顯著縮小了與最先進的開源和閉源VLMs的性能差距,並在多個案例中超越了它們。
English
Vision-Language Models (VLMs) have achieved remarkable progress, yet their
large scale often renders them impractical for resource-constrained
environments. This paper introduces Unified Reinforcement and Imitation
Learning (RIL), a novel and efficient training algorithm designed to create
powerful, lightweight VLMs. RIL distinctively combines the strengths of
reinforcement learning with adversarial imitation learning. This enables
smaller student VLMs not only to mimic the sophisticated text generation of
large teacher models but also to systematically improve their generative
capabilities through reinforcement signals. Key to our imitation framework is
an LLM-based discriminator that adeptly distinguishes between student and
teacher outputs, complemented by guidance from multiple large teacher VLMs to
ensure diverse learning. This unified learning strategy, leveraging both
reinforcement and imitation, empowers student models to achieve significant
performance gains, making them competitive with leading closed-source VLMs.
Extensive experiments on diverse vision-language benchmarks demonstrate that
RIL significantly narrows the performance gap with state-of-the-art open- and
closed-source VLMs and, in several instances, surpasses them.