Feintuning multimodaler LLMs mit ART: Kunstbasiertes Verstärkungstraining

Zusammenfassung

Es gibt zwei hauptsächliche Techniken des parametereffizienten Feintunings (PEFT) für große Sprachmodelle (LLMs). Während die Niedrigrangige Adaption (Low-Rank Adaptation, LoRA) zusätzliche Gewichte zwischen den LLM-Schichten einführt, führt das weiche Prompting (Soft Prompting) zusätzliche feintuningspezifische Roh-Token in die Eingabe eines LLMs ein. Beide erfordern jedoch eine Modifikation der Berechnungsgraphen vorkompilierter, voroptimierter LLMs. Daher wird keine der beiden in Hochdurchsatz-Engines wie vLLM vollständig unterstützt. Wir schlagen Feintuning mittels ART (kunstbasiertes Verstärkungstraining) vor. Die Methode injiziert Informationen in ein eingefrorenes multimodales Großsprachmodell (MLLM), indem sie nur dessen rohe visuelle Eingabe optimiert, und ermöglicht so den Soft-Token-Ansatz auf vorkompilierten Berechnungsgraphen. Sie basiert auf der Rückpropagierung von Gradienten zurück in ein einfaches Pixelarray und unterstützt somit jedes Feintuning-Ziel. Darüber hinaus kann die optimierte visuelle Eingabe als aufgabenrelevante Computerkunstwerke stilisiert werden. Die Wirksamkeit des Ansatzes wird für verschiedene Größen einer populären offenen Qwen-Architektur sowie für mehrere textbasierte Benchmarks bestätigt. Insbesondere erreicht ART eine Genauigkeit, die mit LoRA bei Mathematik- und strukturierten Werkzeugnutzungs-Benchmarks konkurrieren kann.

English

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.