透過獎勵引導解碼控制多模態大型語言模型

摘要

隨著多模態大型語言模型（MLLMs）的廣泛應用，將其適應於多樣化的用戶需求變得日益重要。本文中，我們研究通過控制解碼來適應MLLMs的方法。為此，我們首次提出了一種基於獎勵引導的MLLMs解碼方法，並展示了其在提升視覺基礎能力方面的應用。我們的方法包括構建視覺基礎的獎勵模型，並利用這些模型來引導MLLM的解碼過程。具體而言，我們構建了兩個獨立的獎勵模型，分別控制模型輸出中物體精確度和召回率的程度。我們的方法實現了MLLM推理過程的即時可控性，主要體現在兩個方面：首先，通過在解碼過程中控制每個獎勵函數的相對重要性，使用戶能夠在圖像描述任務中動態權衡物體精確度與召回率；其次，通過控制解碼過程中的搜索廣度，使用戶能夠在測試時計算量與視覺基礎程度之間進行權衡。我們在標準的物體幻覺基準上評估了我們的方法，結果表明，它在提供對MLLM推理的顯著可控性的同時，始終優於現有的幻覺緩解方法。

English

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.

透過獎勵引導解碼控制多模態大型語言模型

Controlling Multimodal LLMs via Reward-guided Decoding

摘要

Support