다중모드 LLM 제어를 위한 보안-유도 디코딩

초록

다중모드 대형 언어 모델(Multimodal Large Language Models, MLLMs)이 광범위하게 적용됨에 따라, 이를 다양한 사용자 요구에 맞게 조정하는 것이 점점 더 중요해지고 있다. 본 논문에서는 제어된 디코딩을 통해 MLLMs의 적응을 연구한다. 이를 위해, 우리는 MLLMs의 보상 기반 디코딩을 위한 첫 번째 방법을 소개하고, 이를 시각적 근거(visual grounding) 개선에 적용하는 것을 보여준다. 우리의 방법은 시각적 근거를 위한 보상 모델을 구축하고 이를 MLLM의 디코딩 과정을 안내하는 데 사용하는 것을 포함한다. 구체적으로, 우리는 모델 출력에서 객체 정밀도(object precision)와 재현율(recall)의 정도를 독립적으로 제어하기 위해 두 개의 별도 보상 모델을 구축한다. 우리의 접근 방식은 두 가지 방식으로 MLLM의 추론 과정에 대한 실시간 제어를 가능하게 한다: 첫째, 디코딩 과정에서 각 보상 함수의 상대적 중요도를 제어함으로써, 사용자가 이미지 캡션 작업에서 객체 정밀도와 재현율을 동적으로 조정할 수 있게 한다; 둘째, 디코딩 과정에서 탐색의 범위를 제어함으로써, 사용자가 테스트 시간 계산량과 시각적 근거의 정도 사이의 균형을 조절할 수 있게 한다. 우리는 이 방법을 표준 객체 환각(object hallucination) 벤치마크에서 평가하여, MLLM 추론에 대한 상당한 제어 가능성을 제공하면서도 기존의 환각 완화 방법들을 일관되게 능가함을 보여준다.

English

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.

다중모드 LLM 제어를 위한 보안-유도 디코딩

Controlling Multimodal LLMs via Reward-guided Decoding

초록

Support