統一強化學習與模仿學習於視覺語言模型

摘要

視覺語言模型（VLMs）已取得顯著進展，但其龐大規模往往使其在資源受限的環境中難以實用。本文介紹了一種新穎且高效的訓練算法——統一強化與模仿學習（RIL），旨在創建強大而輕量級的VLMs。RIL獨特地結合了強化學習與對抗性模仿學習的優勢，使較小的學生VLMs不僅能模仿大型教師模型的複雜文本生成，還能通過強化信號系統性地提升其生成能力。我們模仿框架的關鍵在於一個基於LLM的判別器，它能夠熟練地區分學生與教師的輸出，並輔以多個大型教師VLMs的指導，以確保學習的多樣性。這種統一學習策略，結合了強化與模仿，使學生模型能夠實現顯著的性能提升，使其與領先的閉源VLMs相媲美。在各種視覺語言基準上的廣泛實驗表明，RIL顯著縮小了與最先進的開源和閉源VLMs的性能差距，並在多個案例中超越了它們。

English

Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

統一強化學習與模仿學習於視覺語言模型

Unified Reinforcement and Imitation Learning for Vision-Language Models

摘要

Support