ART（アートベース強化学習訓練）を用いたマルチモーダルLLMのファインチューニング

要旨

大規模言語モデル（LLM）向けのパラメータ効率的ファインチューニング（PEFT）手法には主に二つある。低ランク適応（LoRA）はLLM層間に追加の重みを導入するのに対し、ソフトプロンプトはLLMの入力にファインチューニング固有の生のトークンを追加する。しかし、いずれもプリコンパイル済みで事前最適化されたLLMの計算グラフの変更を必要とする。その結果、vLLMのような高スループットエンジンではどちらも完全にはサポートされていない。我々はART（Art-based Reinforcement Training）によるファインチューニングを提案する。この手法は、凍結されたマルチモーダル大規模言語モデル（MLLM）に対し、生の視覚入力のみを最適化することで情報を注入し、プリコンパイルされた計算グラフ上でソフトトークンアプローチを可能にする。これは勾配をプレーンなピクセル配列に逆伝播することに依存しており、したがって任意のファインチューニング目的をサポートする。さらに、最適化された視覚入力をタスク関連の計算芸術作品としてスタイリングすることもできる。本手法の有効性は、一般的なオープンなQwenアーキテクチャの異なるサイズと、複数のテキストベンチマークにおいて確認された。具体的には、ARTは数学および構造化ツール使用のベンチマークにおいてLoRAと競合する精度を達成している。

English

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.