多模态提示优化：为何不利用多重模态赋能多模态大语言模型

摘要

大型语言模型（LLMs）已展现出显著的成功，其多模态扩展（MLLMs）进一步解锁了跨越图像、视频及其他非文本模态的能力。然而，尽管这一转变发生，旨在减轻手动提示设计负担并最大化性能的提示优化方法仍局限于文本领域，最终限制了MLLMs的全部潜力。受此差距启发，我们引入了多模态提示优化这一新问题，将先前提示优化的定义扩展至由文本与非文本提示对定义的多模态空间。为解决此问题，我们随后提出了多模态提示优化器（MPO），一个统一框架，不仅通过保持对齐的更新执行多模态提示的联合优化，还利用早期评估作为先验，在基于贝叶斯的选择策略中指导候选提示的选择过程。通过跨越多样化模态（如图像、视频乃至分子）的广泛实验，我们证明了MPO优于领先的仅文本优化方法，确立了多模态提示优化作为实现MLLMs潜力的关键步骤。

English

Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.

多模态提示优化：为何不利用多重模态赋能多模态大语言模型

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

摘要

Support