強化学習によるファインチューニングがマルチモーダル大規模言語モデルの推論能力を強化する

要旨

2025年、人工汎用知能（AGI）の追求における重要な分岐点に立つ中、強化学習によるファインチューニング（RFT）は、大規模言語モデル（LLM）の推論能力を向上させる上で大きな可能性を示し、OpenAI-o1やDeepSeek-R1といった最先端のAIモデルの開発につながっています。さらに、マルチモーダル大規模言語モデル（MLLM）の推論能力を強化するためのRFTの効率的な応用は、コミュニティから広く注目を集めています。本ポジションペーパーでは、強化学習によるファインチューニングがマルチモーダル大規模言語モデルの推論能力を強化することを主張します。まず、この分野に興味を持つ研究者が熟知すべき基本的な背景知識について詳細に紹介します。さらに、RFTがMLLMの推論能力を強化するための改善点を、多様なモダリティ、多様なタスクとドメイン、より優れたトレーニングアルゴリズム、豊富なベンチマーク、そして活発なエンジニアリングフレームワークという5つのキーポイントにまとめます。最後に、コミュニティが検討すべき今後の研究に向けた5つの有望な方向性を提案します。本ポジションペーパーが、AGIへの進展におけるこの重要な段階で、コミュニティに貴重な洞察を提供することを願っています。MLLM向けのRFTに関する研究の概要は、https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs でご覧いただけます。

English

Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.

強化学習によるファインチューニングがマルチモーダル大規模言語モデルの推論能力を強化する

Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

要旨

Support