使用ART微調多模態大型語言模型：基於藝術的強化訓練

摘要

大型語言模型（LLM）主要有兩種參數高效微調（PEFT）技術。低秩適應（LoRA）在 LLM 層間引入額外權重，而軟提示（Soft Prompting）則在 LLM 輸入中添加專屬微調的原始標記。然而，這兩種方法均需修改預編譯且預優化的 LLM 計算圖，因此在高吞吐量引擎（如 vLLM）中無法獲得完整支援。我們提出基於藝術的強化訓練（ART）微調方法，該方法透過僅優化凍結的多模態大型語言模型（MLLM）的原始視覺輸入來注入資訊，從而在預編譯計算圖上實現軟標記方法。此方法依賴將梯度反向傳播至純像素陣列，因此能支援任何微調目標。此外，優化後的視覺輸入可被風格化為與任務相關的運算藝術作品。該方法的有效性已在不同規模的流行開源 Qwen 架構及多項文本基準測試中得到驗證。具體而言，ART 在數學與結構化工具使用基準測試中，準確度可與 LoRA 匹敵。

English

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.