ChatPaper.aiChatPaper

使用ART微調多模態大型語言模型:基於藝術的強化訓練

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

June 10, 2026
作者: Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski
cs.AI

摘要

大型語言模型(LLM)主要有兩種參數高效微調(PEFT)技術。低秩適應(LoRA)在 LLM 層間引入額外權重,而軟提示(Soft Prompting)則在 LLM 輸入中添加專屬微調的原始標記。然而,這兩種方法均需修改預編譯且預優化的 LLM 計算圖,因此在高吞吐量引擎(如 vLLM)中無法獲得完整支援。我們提出基於藝術的強化訓練(ART)微調方法,該方法透過僅優化凍結的多模態大型語言模型(MLLM)的原始視覺輸入來注入資訊,從而在預編譯計算圖上實現軟標記方法。此方法依賴將梯度反向傳播至純像素陣列,因此能支援任何微調目標。此外,優化後的視覺輸入可被風格化為與任務相關的運算藝術作品。該方法的有效性已在不同規模的流行開源 Qwen 架構及多項文本基準測試中得到驗證。具體而言,ART 在數學與結構化工具使用基準測試中,準確度可與 LoRA 匹敵。
English
There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.