ART를 이용한 멀티모달 LLM 미세 조정: 예술 기반 강화 학습

초록

대규모 언어 모델(LLM)을 위한 주요 매개변수 효율적 미세 조정(PEFT) 기술에는 두 가지가 있다. 저랭크 적응(LoRA)은 LLM 계층 간에 추가 가중치를 도입하는 반면, 소프트 프롬프팅은 LLM 입력에 미세 조정 전용의 원시 토큰을 추가로 도입한다. 그러나 두 방법 모두 사전 컴파일 및 사전 최적화된 LLM의 계산 그래프를 수정해야 한다. 결과적으로, vLLM과 같은 고처리량 엔진에서는 어느 방법도 완전히 지원되지 않는다. 우리는 ART(Art 기반 강화 훈련)를 이용한 미세 조정을 제안한다. 이 방법은 냉동된 다중 모달 대규모 언어 모델(MLLM)의 원시 시각적 입력만 최적화하여 정보를 주입함으로써, 사전 컴파일된 계산 그래프에서 소프트 토큰 접근 방식을 가능하게 한다. 이는 일반 픽셀 배열로의 그래디언트 역전파에 의존하므로 모든 미세 조정 목표를 지원한다. 또한 최적화된 시각적 입력은 작업 관련 계산 예술 작품으로 양식화될 수 있다. 이 접근 방식의 효과는 널리 사용되는 공개 Qwen 아키텍처의 다양한 크기와 여러 텍스트 기반 벤치마크에서 확인되었다. 구체적으로, ART는 수학 및 구조화된 도구 사용 벤치마크에서 LoRA와 경쟁력 있는 정확도를 달성한다.

English

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.